Conversation
📊 Benchmark & performance —
|
| N | fused | NumPy 2.4.2 | NumSharp speedup |
|---|---|---|---|
| 64 | 0.547 | 4.322 | 7.9× |
| 4,096 | 0.549 | 0.660 | 1.20× |
| 262,144 | 0.557 | 1.419 | 2.55× |
The kernel is size-invariant (~0.55 ns/elem at every size) while NumPy degrades 2–6× as data spills out of cache.
All 11 ops on this path — speedup vs NumPy @262K (f64):
abs 3.37× negate 3.15× floor 3.07× trunc 3.03× round 3.00×
sqrt 2.55× rad2deg 2.41× deg2rad 2.22× square 2.18× reciprocal 1.72×
Verified 22,000 bit-exact checks (fused == contiguous kernel); full unit suite 9447/0/11.
Note: this is a
DirectILKernelGeneratorwhole-array kernel that bypasses NpyIter by design — the fusion (gather folded intoVector.Create) is incompatible with NpyIter's gather/kernel separation, which is exactly the (slower) buffered path it replaces.
2. Official NumSharp-vs-NumPy benchmark (6038990f)
Methodology: BenchmarkDotNet Full — 50 iterations, InProcessEmit toolchain, iteration-time capped at 25 ms — × {1K / 100K / 10M} vs NumPy 2.4.2. i9-13900K · .NET 10.0.101 · Python 3.12.12. 1,813 C# measurements → 1,111 matched comparisons.
The iteration-time cap is what makes a Full run feasible: BDN's default Throughput strategy ramps to ~8192 invocations/iteration, so a 10M-element op at 50 iters took ~25 s per case. Capping it ⇒ ~15× faster (a 30-case set went 18 min → 70 s) with all 50 iterations preserved.
Headline — geomean (NumSharp ÷ NumPy, lower = better):
slower ◄───────── 1.0 (parity) ─────────► faster
1K ████████████████████ 1.96× (102 win / 212 lose)
100K ██████████████████▎ 1.83× (109 win / 196 lose)
10M ██████████▏ ........ 1.00× (166 win / 36 lose) ◄ PARITY
At the memory-bound 10M size NumSharp is at parity across ~409 ops (166 faster, only 36 slower). Small-size cost is the per-element dispatch + result-allocation tax (~2×).
Per-suite geomean by size:
| suite | 1K | 100K | 10M |
|---|---|---|---|
| Statistics | 0.19× | 0.68× | 0.48× ✅ |
| Sorting | 0.41× | 1.13× | 0.45× ✅ |
| Comparison | 1.27× | 2.22× | 0.50× ✅ |
| Bitwise | 8.16× | 1.16× | 0.61× ✅ |
| Reduction | 0.48× | 0.94× | 0.91× ✅ |
| Arithmetic | 3.09× | 2.62× | 1.25× 🟡 |
| Unary | 3.50× | 4.44× | 1.53× 🟡 |
| Creation | 12.26× | 2.92× | 2.24× 🟠 |
| LinearAlgebra | 2.76× | 1.66× | 4.02× 🔴 |
🏆 Biggest wins (@10m, real ms):
| op | dtype | NumPy | NumSharp | speedup |
|---|---|---|---|---|
average |
f32 | 9.60 | 0.94 | 10.2× |
nansum |
f32 | 14.35 | 1.49 | 10.0× |
nanprod |
f32 | 18.52 | 1.90 | 9.7× |
var |
f32 | 16.96 | 2.60 | 6.5× |
count_nonzero |
f64 | 22.61 | 3.74 | 6.0× |
nanmean |
f64 | 33.47 | 5.69 | 5.9× |
🎯 Biggest gaps (@10m) — optimization targets:
| op | dtype | NumPy | NumSharp | gap |
|---|---|---|---|---|
sum axis=1 |
uint8 | 3.12 | 49.74 | 16.0× |
dot |
f64 | 1.23 | 16.46 | 13.4× |
matmul |
f64 | 0.72 | 4.26 | 5.9× |
argsort |
int32 | 369 | 2162 | 5.9× |
→ three fronts: narrow-int axis reductions (no widening-SIMD), linear algebra (no BLAS), sort.
Per-dtype @10m (geomean):
int64 0.91 uint64 0.92 f32 0.93 f64 0.98 uint8 1.00 uint32 0.99 ◄ strong
int32 1.11 int16 1.14 uint16 1.24 bool 1.60 ◄ weak (bool, narrow-uint)
Dtype coverage: 10 dtypes compared vs NumPy; char/decimal measured but have no NumPy peer (C#-only). SByte/Half/Complex were uncovered and have since been added to the benchmark code (48e85528) — the next run produces the full 15-dtype matrix.
Reproducibility
- Reusable cross-platform runner:
python benchmark/run_benchmark.py(builds C#, runs BDN per-suite, sweeps NumPy at 3 sizes, merges, archives). - Full report:
benchmark/benchmark-report.md(1,311 rows). - Provenance snapshot keyed by date+hash:
benchmark/history/2026-06-05_6038990f/(manifest + report + NumPy timings).
…tier; AV→NA; one CI
Folds the NDIter benchmark into the official orchestrator so there is ONE entry
point and ONE report, while keeping the two harnesses distinct (they measure
different things — op/dtype/N throughput vs the iterator machinery — and the
NDIter harness needs internal access + section-isolation the BenchmarkDotNet
in-process run can't give).
run_benchmark.py — after the official (op,dtype,N) merge, runs the NDIter sheet
+ cards and APPENDS the sheet to benchmark-report.md as its own section (not
merged — different result model). Archives nditer_results.{md,tsv} + cards into
results/<ts>/. New --skip-nditer flag. This is now the single command for the
whole NumSharp-vs-NumPy comparison.
+10M tier (decision 1): nditer_bench.{cs,py} sweep now scalar/1K/100K/1M/10M
(grid 2500x4000 = 10M exactly; pick 30 iters/3 rounds at 10M). sheet TIERS +
cards pick it up automatically.
AV → NA/IGNORED (decision 3): instead of silently omitting a section that
crashes all retries, the sheet now records its ids NA (NumPy runs first to give
the expected id set), prints an AV-POLICY header explaining the known
intermittent AccessViolation is ignored, lists 'THIS RUN: NA across <sections>',
shows NA cells in the per-family/dividends matrices, and excludes NA from every
geomean. tsv stores NA; load/cards skip it.
CI consolidation (decision 2): nditer-benchmark.yml -> benchmark.yml, now runs
the ENTIRE suite via run_benchmark.py. Trigger changed from workflow_run-on-
every-build to release:published (the real 'after a successful release' signal —
'Build and Release' publishes a GitHub Release on a v* tag) + workflow_dispatch,
so the heavy full suite runs per-release, not per-push. Commits report + cards
to master with [skip ci]. timeout-minutes: 180.
The npyiter_parity_poc gather kernels and the rest of the harness methodology
(Release-only, matched kernels, positive-not-copyto, section isolation) are
unchanged.
…n selection Refreshes the canonical NDIter results (nditer_results.md/.tsv) and the two README cards with a full sweep that now includes the 10M cache tier, and records the AV->NA policy firing on a real run. Also documents the run_benchmark.py integration in benchmark/CLAUDE.md. What changed ------------ * 198 measured pairs (was 162), 35 of them NA. The new 10M tier adds 36 ids across the size-swept families; SIZES is now scalar/1K/100K/1M/10M end to end (bench .cs + .py grids: 10M = 2500x4000). * selection (where / a[mask] / a[mask]= / count_nz / argwhere / a[idx] / a[idx]=) hit NumSharp's known intermittent AccessViolation on EVERY retry this run, so the whole section is reported NA/IGNORED per policy and excluded from every geomean. The header now reads "198 measured pairs (35 NA)" and "AV POLICY ... THIS RUN: NA across selection."; the section renders as "(no data)" / "-" / "NA" cells instead of crashing the sweep. This is the designed crash-resilience path proven on a live run, not a regression. * Headline operation matrix: 1.17x geomean, 77 win / 53 lose over 130 cells (26 non-selection families x 5 tiers). Reductions lead (1.80x), dtypes 1.59x, elementwise 1.12x; copy/cast (0.65x) and index-math (0.70x) remain the small-N laggards already tracked as canaries. Doc --- benchmark/CLAUDE.md run_benchmark.py section now describes the appended NDIter step (aspect x tier, appended-not-merged, section-isolated, AV->NA, --skip-nditer) and points at benchmark/nditer/README.md, so the dev guide matches the wired-in integration (run_benchmark.py + benchmark.yml). Known bug surfaced (tracked, not fixed here) -------------------------------------------- The selection-section AccessViolation (0xC0000005) is an unmanaged-storage lifetime bug in NumSharp under heavy mixed alloc/free load. It is intermittent (~50% per heavy section) and uncatchable; the benchmark now degrades to NA rather than masking it. Worth a dedicated issue + fix pass.
…ted report artifacts
Adds docs/website-src/docs/benchmarks.md — the DocFX page the user asked for:
"the real place where we discuss and present the efforts to surpass NumPy
through the power of Runtime IL Generation." It is the evidence companion to the
existing IL Generation page (il-generation.md explains HOW the kernels are
emitted; this page shows WHAT that buys head-to-head against NumPy).
The page is driven by the artifacts the Benchmark workflow (benchmark.yml)
auto-commits to master after every release:
* The two 400x300 cards are embedded by absolute raw.githubusercontent master
URLs (same source the README uses), so they always reflect the latest
committed run rather than a pasted screenshot. Verified the docfx build keeps
the URLs absolute (it does not relativize external links).
* The full reports are linked on master: the iterator sheet
(benchmark/nditer/nditer_results.md, which the cards render from) and the
op/dtype/N matrix (benchmark/benchmark-report.md), plus the harness README and
benchmark/CLAUDE.md.
Content (grounded in the current committed nditer_results.md numbers):
* Headline cards + a by-class geomean table (reductions ~1.8x, dtypes ~1.6x,
elementwise ~1.1x parity, copy/cast ~0.65x, index-math ~0.7x).
* Class-by-class discussion tying each result to the IL mechanism (4x unrolling,
tree reduction, SIMD early-exit, per-(op,dtype,layout) specialization), and
honest about the taxes (small-N copy/cast, all-false any() scan, bcast_reduce).
* The dividends NumPy can't structurally match: expression fusion (np.evaluate,
up to ~13x), kernel reuse, parallel inner loop (par8 up to ~8x), cheaper
iterator construction (~2-3x vs np.nditer).
* Methodology + honesty section: Release-only JIT, best-of-rounds, ratios-not-
absolutes, and the AV->NA policy.
* Reproduce-locally commands.
Wiring:
* docs/toc.yml — new "Benchmarks vs NumPy" entry right after IL Generation.
* il-generation.md — cross-link from the Performance Impact section ("naive C#"
table vs the head-to-head-NumPy page).
* index.md — added IL Generation + Benchmarks links to Get Started.
Validated with `docfx build` (build-only, metadata skipped): 0 errors, the page
itself emits 0 warnings (the 84 UidNotFound warnings are api/toc.yml entries that
only resolve after the metadata step, which CI runs first). benchmarks.html
renders, cards resolve to absolute URLs, internal links rewrite to .html.
Note: deploy is via docs.yml on push to master (paths: docs/website-src/**); this
branch commit does not deploy until merged. How the page REFERENCES the
auto-committed cards (raw-master URL vs bundling copies into website-src/images/)
is the next thing to settle.
…FX site
Two follow-ups to the Benchmarks vs NumPy page, both from user direction.
1) The two 400x300 cards now carry the whole canonical summary (modeled on the
ASCII sheet the user singled out), not just one bar chart each. Everything is
still COMPUTED from nditer_results.tsv, so the cards auto-update each run and
NA (AccessViolation) ids are skipped.
* cards/ops.png — OPERATIONS vs NumPy: headline (geomean / win-lose / cells)
+ by-array-size-tier bars (scalar..10M) + by-operation-class bars ranked
best->worst (reductions 1.80x ... copy/cast 0.65x; wins green, the two
small-N taxes red).
* cards/cat.png — the IL-GENERATION DIVIDENDS, the "machinery NumPy has no
equivalent for": iterator build vs np.nditer, expression fusion (np.evaluate),
kernel reuse, parallel inner loop — each bar is the honest geomean with an
"up to <peak>x" annotation — plus the chunk-width trend (w=4 -> w=1024) and
the honest pathology canary (bcast_reduce ~52x behind, in red).
nditer_cards.py rewritten: shared hbars() helper, color_of() (green/amber-
parity/red), stat() for (geomean, peak), two card builders. Imports CTOR/CW/
PATH/DIVIDENDS from the sheet so the section data stays single-sourced.
Captions/alt-text updated to match the new card semantics (cat.png is no longer
"by op class") in README.md and benchmarks.md.
2) Full reports are now rendered INTO the site as searchable pages (user choice:
"Render into the site"), in addition to being linked on GitHub:
* docs/website-src/docs/benchmark-matrix.md — the op/dtype/N matrix
(benchmark-report.md body under a single page H1).
* docs/website-src/docs/benchmark-iterator.md — the canonical iterator sheet
(nditer_results.md fenced block under a page H1).
* toc.yml nests both under "Benchmarks vs NumPy"; benchmarks.md "Read the full
reports" now links the on-site pages (raw files still linked on master).
benchmark.yml regenerates these two pages from the just-produced reports (op
matrix drops its own H1 via tail -n +2 so the page has one title; the iterator
sheet has no H1), commits them alongside the report + cards, and — because the
commit carries [skip ci] and the pages live under docs/website-src/** — then
`gh workflow run docs.yml` to redeploy the site (added actions:write + GH_TOKEN).
Validation
----------
* nditer_cards.py renders both cards; verified visually (legible at 400x300).
* benchmark.yml is valid YAML (yaml.safe_load).
* docfx build (build-only): 0 errors; benchmark-matrix.html + benchmark-iterator.html
generate; benchmarks.html internal links to both resolve; no warning names any new
page (the 82 UidNotFound warnings are api/toc.yml, resolved by the metadata step CI
runs first). No docs/website/ build-output committed.
Still open (deferred by the user): the card REFERENCING mechanism on the docs page
(raw-master URLs today vs bundling the PNGs into website-src/images/). The redeploy
chaining added here would make that swap trivial if chosen later.
… 15 Best" The op/dtype/N matrix report (benchmark-report.md, rendered into the site as benchmark-matrix.md) showcased garbage: every "Top 15 Best" row was np.copy(float64) and np.searchsorted at "0.0 / 0.0x". Three distinct bugs, all fixed. BUG 1 — searchsorted benchmark measured nothing (both sides) SortingBenchmarks.cs and numpy_benchmark.py issued a SINGLE scalar lookup (np.searchsorted(sorted, N/2)) — one O(log N) binary search, ~18ns at EVERY N, pure call overhead. Against NumPy's ~1µs Python overhead that manufactured a meaningless 50–1000x "win". Fixed: both now query the N-element array (a) into the sorted target → N binary searches, real work that scales with N. (Verified the C# benchmark project still compiles.) BUG 2 — normalize_op_name collapsed a slice-copy onto np.copy The Slicing suite's "np.copy(a[100:1000])" (a fixed 900-element slice copy, ~3.6µs at every N) was normalized by stripping ALL "[...]" — including the array-index "[100:1000]" — yielding "np.copy", which COLLIDED with the Creation full-array "np.copy(a)" in csharp_index (last-write-wins) and overwrote the real float64 measurement. THAT was the bogus "copy float64 = 0.0036ms" (not a copy bug at all; the op is fine — archived raw float64 copy@10M = 11.04ms). Fixed: only strip a space-separated " [annotation]" (\s+\[ instead of \s*\[), never index brackets attached to an identifier. Incidentally also de-collides concatenate/stack/slice variants. copy(float64) now reads its real values across all sizes (10M → 11.04ms, ratio 0.60 = a genuine win). BUG 3 — the report ranked/averaged non-credible rows as wins merge-results.py sorted "Top Best" by ratio with only a `ratio is not None` guard, so a sub-resolution NumSharp time (ratio rounding to 0.0) sorted to #1, and CSV blanked legit 0.0 via `r.ratio or ''`. Fixed with a credibility gate (classify()): a row is "negligible" (new ▫ status) when either side did <1µs of work OR the speedup exceeds 20x (NumSharp >20x faster ⇒ artifact: a view, a lazy alloc, or a dead-code-eliminated kernel). Negligible rows are EXCLUDED from Top Best/Worst and from the per-size geomean, but still listed (▫) in the per-suite tables — nothing hidden. Also: store ms at 4 / ratio at 3 decimals, show 3-decimal ms + 2-decimal ratio in the showcase (no more "0.0/0.0x"), fix the `or ''` falsy-zero in CSV, add the ▫ legend row + summary/size-table counts, and a header note stating how many rows were excluded and why. Result (regenerated from the on-disk run archive with the fixed merge): * Top Best is now real reductions/statistics wins (np.nansum 0.08x, np.percentile 0.10x, np.average 0.10x) — genuine ms on both sides. * 1233 ops → 305 faster / 255 close / 169 slower / 103 much-slower / 275 NEGLIGIBLE (the artifacts, previously ~all counted as "faster") / 126 n/a. * Top Worst surfaces a real gap: np.zeros (NumSharp eagerly zeros ~10.7ms vs NumPy lazy calloc ~0.01ms) — a legitimate optimization target, not an artifact. benchmark-matrix.md (the DocFX page) re-seeded from the corrected report; docfx build clean (0 errors). The searchsorted benchmark fix takes effect on the next CI run; the credibility gate keeps any residual artifact out of the showcase meanwhile.
… 1.3–6.1) Branch advanced 31 substantive commits past the first changelog (which described through 33058b8). The branch was rebased meanwhile — the original changelog commit bb7ed7a8 is orphaned, its twin is 4140f4d, and 33058b8 remains an ancestor of HEAD, so 33058b8..HEAD is the true new-work boundary. Learned and folded in: - np.evaluate — Tier-3C fusion made public; per-node NumPy result_type typing (fixes the mixed-tree dtype bug: i4*i4+f8 must wrap in int32 first), fused reductions, EXTERNAL_LOOP guard, out= via ufunc rules. 3.2–6.1x vs NumPy. - out=/where=/dtype= across the elementwise ufunc API (binary, unary-math, comparisons, predicates, bitwise, invert, arctan2) — one NumPy-shaped overload each, exact broadcast/cast/error-text semantics. - New at np.*: bitwise_and/or/xor (were operator-only, CS0117) and positive. - nditer: WRITEMASKED/ARRAYMASK execution + VIRTUAL operands (was silent masked-write corruption); Wave-1.4 fixes (size-1 stride-0 invariant, op_axes OOB, write-broadcast validation, PARALLEL_SAFE, unit-axis absorb). - Alloc Wave 2.4: buffer-pool window 4KiB–1MiB -> 1B–64MiB, pool-side GC pressure, finalizer suppression. - Canonical NDIter benchmark suite + post-release benchmark.yml CI + DocFX Benchmarks-vs-NumPy website pages; honest frontier findings recorded (broadcast-reduce 54x, scalar np.any 14.5x, BUFFERED+REDUCE ForEach P0 crash, parallel banding 4.7x win). Stats refreshed: 272/519/+198k -> 312 commits, 615 files, +217,949/-16,402. Tests: 9,447 -> 9,709 passed/0 failed (net10.0). New-API count 30 -> 35. Same content (minus H1) pushed live to the PR #611 description via REST PATCH.
…oard page Adds a new DocFX page in the nditer_results.md dashboard style (ASCII bars, geomeans, win/lose, top wins/losses) applied to the broad op × dtype × N matrix — the graph/stats/ numbers companion to the narrative benchmarks.md, with minimal prose. * benchmark/scripts/render_dashboard.py — reads the merged benchmark-report.json and emits benchmark-dashboard.md: headline geomean, BY-SIZE-TIER / BY-SUITE / BY-DTYPE bars (same bar() aesthetic as nditer_sheet.py — length 10 = parity, 20 = 2.0×), the status mix, and TOP-12 wins/losses with raw ms. Charts only CREDIBLE rows (the merge-results.py gate), so the negligible artifacts that used to dominate stay out. speedup = NumPy ÷ NumSharp. * docs/website-src/docs/benchmarks-dashboard.md — the page (title + one-line note + the ```-fenced sheet), seeded from the renderer. Nested under "Benchmarks vs NumPy" in toc.yml as "Dashboard (op matrix)", beside the full Operation matrix and Iterator sheet. * benchmark/.gitignore — ignore the benchmark-dashboard.md intermediate (the tracked form is the DocFX page), matching how benchmark-report.json/csv are handled. What it shows on the current data (honest, broad picture vs the curated nditer sheet): 0.74× geomean over 832 credible cells (305 win / 527 lose) — NumSharp trails on the full matrix but reaches parity at 10M (0.98×), and wins decisively where its IL kernels shine: statistics 2.28×, broadcasting 1.22×, reduction 1.21×; uint8 1.07×. Laggards are arithmetic/ unary/creation and bool. Top wins: nansum/percentile/average (8–13×). Top losses: np.zeros (eager-zero vs NumPy lazy calloc, ~500–880×) and argsort (~25×). Prototype scope: the page is a committed STATIC snapshot. To make it live (auto-refresh each release like the matrix/iterator pages), wire render_dashboard.py + a seed step into run_benchmark.py / benchmark.yml — deferred pending design review. docfx build is clean.
Two net8.0-only BCL semantic gaps surfaced by the fuzz differential matrix.
Both behave correctly on net9.0+ (where the BCL was fixed) but produced
wrong values on net8.0; worked around to match NumPy 2.4.2.
1. np.abs(complex) with an infinite component returned NaN instead of +inf
------------------------------------------------------------------------
cabs(NaN + inf*i) must be +inf (C99 hypot / npy_cabs: the infinity test
precedes the NaN test). System.Numerics.Complex.Abs routes through a
private Hypot whose operand ordering is NaN-unaware, so on net8.0 it
returns NaN for abs(NaN+inf*i) (fixed in the .NET 9 BCL).
Added Utilities/NDComplexMath.Abs(Complex): returns +inf when either
component is infinite, else defers to Complex.Abs — so every finite/
NaN-only magnitude that already matched NumPy bit-for-bit is unchanged.
Repointed the two cached MethodInfo handles that drive every complex-abs
emit site: DirectILKernelGenerator.CachedMethods.ComplexAbs (6 IL call
sites across the scalar/strided/predicate/math/decimal unary loops) and
DefaultEngine.UnaryOp.s_complexAbs (NDIter Tier-3B route).
Fixes 19 unary.jsonl + 1 random_smoke.jsonl fuzz cases (all layouts:
contiguous / strided / transposed / broadcast / negstride).
2. ptp / amax / amin along an axis dropped NaN instead of propagating it
------------------------------------------------------------------------
The typed-struct leading/innermost axis-reduction fast paths
(MinOp<T>/MaxOp<T>.Combine256/128) called raw Vector256/128.Min/Max. The
x86 vminps/vmaxps these lower to return the SECOND operand on an
unordered (NaN) compare; the BCL Vector{N}.Min/Max only adopted IEEE NaN
propagation in .NET 9. Verified: Vector128.Max(NaN,5) == 5 on net8.0,
== NaN on net10.0. So max/min/ptp over a NaN-laced axis silently lost
the NaN on net8.0 (ptp axis=0 returned a finite value where NumPy = NaN).
Routed MinOp/MaxOp through the existing NaNAwareMinMax256/128 helper
(already used by the contiguous/strided CombineVectors paths) and wrapped
that helper's float/double self-equality mask in #if NET8_0 — so net9.0+
keeps the single-instruction vmaxps with zero overhead while net8.0 gets
ConditionalSelect(ordered, min/max, a+b) NaN propagation. The flat
whole-array reduction kernel already emitted this via
EmitVectorNaNPropagatingMinMax, so only the axis fast paths were affected.
Fixes 12 stat.jsonl fuzz cases (ptp float32/float64, axis 0/1, C/F-contig).
Verification: full unit suite green on BOTH net8.0 and net10.0 (9709 passed
/ 0 failed under the CI filter), FuzzMatrix 42/42 on both. The originally
reported trunc "Could not find Truncate for Vector128" failures were already
resolved in-tree by the CanUseUnarySimd #if NET8_0 guard (commit 5716f86);
the leak-guard working-set tests pass locally (their CI failures were OS
working-set / GC-mode noise, not a managed or unmanaged leak).
…NumSharp faster)
The dashboard prototype was the odd one out: I rendered it speedup = NumPy ÷ NumSharp
(>1× = faster), while the op-matrix report it is derived from — and merge-results.py —
use ratio = NumSharp ÷ NumPy (<1× = faster, lower is better). Two pages off the same data
with opposite conventions is exactly the faster/slower confusion to avoid.
Verified first that the underlying direction is NOT a flip: counting raw milliseconds
(numsharp_ms vs numpy_ms, no ratio involved), NumSharp took LESS time on 305 ops and MORE
time on 526 of 832 credible ops; geomean NS/NP = 1.36. So "NumSharp trails on the broad
matrix" is real (concentrated in Arithmetic = 231 slower ops, and Unary), and it matches the
op-matrix report's own conclusion. The dashboard's data was right; only its convention was
inverted relative to the house default.
render_dashboard.py now uses NS/NP throughout:
* ratio = numsharp_ms / numpy_ms; header + axis read "faster ◄ 1.0 (parity) ► slower".
* HEADLINE 1.36× geomean · 305 faster / 527 slower.
* by-suite / by-dtype ranked fastest→slowest (ascending ratio): statistics 0.44×,
reduction 0.83×, broadcasting 0.82× now read as FASTER; creation 2.83× / unary 2.63× /
bool 3.55× as slower.
* status bands relabelled to NS/NP (faster ≤1.0× / close 1–2× / slower 2–5× / much >5×).
* tables renamed FASTEST / SLOWEST; each row shows the NS/NP ratio plus a human factor
("0.079× (12.6× faster)", "880.9× (881× slower)") so the small-ratio-is-good direction is
unambiguous.
benchmarks-dashboard.md re-seeded with the matching note; docfx build clean. This makes the
report + dashboard consistent. The narrative benchmarks.md, the nditer iterator sheet, and
the README cards still use the speedup (NP/NS, >1× = faster) framing — flipping those is a
separate call (they are win-showcases where >1× reads naturally).
…m the changelog Per review: the changelog should describe the final state, not the development path. Removed the temporal 'Latest wave (Waves 1.3–6.1) — added after the first changelog' umbrella section entirely and dissolved its content into the proper topical sections, with all 'wave' terminology and 'added after'/'previously absent'/'now reachable' path-language gone: - np.evaluate folded into §2 (NDExpr DSL): per-node result_type typing, fused reductions, out= rules, EXTERNAL_LOOP guard, measured speedups. - out=/where=/dtype= ufunc kwargs folded into §5 as a parity subsection. - WRITEMASKED/ARRAYMASK execution, VIRTUAL operands, and the size-1 stride-0 / op_axes-OOB / write-broadcast / PARALLEL_SAFE / unit-axis fixes folded into §1 (capability matrix + bug list); masked-write corruption fix added to §10. - buffer-pool window (1 B–64 MiB), pool-side GC pressure, finalizer suppression folded into §7; TL;DR memory bullet updated. - canonical NDIter benchmark, benchmark.yml CI, DocFX benchmark pages, and the honest frontier findings folded into §8/§15. - 'NPYITER_GAPS_AND_ROADMAP … 6-wave plan' -> 'prioritized roadmap'. Net: zero 'wave' occurrences; the 16-section topical structure is intact. Same content (minus H1) pushed live to the PR #611 description.
… stat Per updated direction: the ratio convention is NumPy ÷ NumSharp again (>1.0× = NumSharp faster — bars grow right = faster, the original visual), AND every row now also carries 🕐 %NumPy = (NumSharp ÷ NumPy) × 100 = the share of NumPy's time NumSharp uses. So a win reads two intuitive ways: "12.63× faster" and "🕐 8%" (takes only 8% of the time NumPy would); parity is 🕐 100%; >100% is slower. Huge slowdowns compact to e.g. 🕐 881×NP. render_dashboard.py: * r["sp"] = numpy/numsharp (speedup), r["pct"] = numsharp/numpy*100 (share of NumPy time). * headline + every bar/table show both: HEADLINE 0.74× geomean · 🕐 136%; by-suite e.g. statistics 2.28× 🕐 44%, reduction 1.21× 🕐 83%, creation 0.35× 🕐 283%; FASTEST nansum 12.63× 🕐 8%; SLOWEST np.zeros 0.001× 🕐 881×NP. * status-mix bands relabelled in %NumPy terms (faster ≤100% / close 100–200% / slower 200–500% / much >500%), a legend line explains the 🕐 stat, pct_str() keeps the column narrow (NN% under 1000%, else NN×NP). benchmarks-dashboard.md re-seeded with the matching note (heredoc — printf mis-read %NumPy as a format spec); docfx build clean, emoji verified present (U+1F550 ×54). Supersedes the brief NS/NP experiment (c0a5346). The op-matrix report (merge-results.py) still uses NS/NP "lower is better", and the nditer sheet / cards use NP/NS without the %NumPy stat — rolling the NP/NS + 🕐 %NumPy convention out to those is the next step, pending confirmation.
Completes the rollout chosen after the dashboard fix: every benchmark surface now uses the SAME convention — speedup = NumPy ÷ NumSharp (>1.0× = NumSharp faster) — and every surface also carries 🕐 %NumPy = (NumSharp ÷ NumPy) × 100 = the share of NumPy's time NumSharp uses (30% = takes only 30% of the time NumPy would; <100% = faster; huge slowdowns compact to e.g. 880×NP). So a win reads two intuitive ways at once: "12.66× faster" and "🕐 8%". Op-matrix report (merge-results.py) — FLIPPED from NS/NP to NP/NS (the one surface that was "lower is better"): * ratio = numpy_ms / numsharp_ms; new pct_numpy field on UnifiedResult (JSON + CSV). * get_status bands inverted around >1 = faster (faster ≥1.0× / close 0.5–1.0× / slower 0.2–0.5× / much <0.2×); classify() credibility gate flips to ratio > 20 (was < 1/20). * Best/Worst now sort DESCENDING (fastest first); legend + tables + summary-by-size gain a 🕐 %NumPy column; ratio_fmt keeps tiny slowdowns readable (0.001× not 0.00×). * Regenerated from the on-disk run archive: Top Best nansum 12.66× 🕐 8%; Top Worst np.zeros 0.001× 🕐 880×NP; searchsorted stays negligible (now ratio>20). Counts unchanged (305/255/169/103/275/126) — same rows, just the direction relabelled. nditer sheet (nditer_sheet.py) + cards (nditer_cards.py) — already NP/NS, ADDED 🕐 %NumPy: * sheet: legend line + per-bar 🕐 %NumPy + headline "1.17× geomean · 🕐 85% of NumPy's time"; re-rendered nditer_results.md (--render-only, AV block intact). * cards: each bar label now "1.80× · 56%" (ops) / "4.3× · 23%" (dividends); footer explains the %. No emoji in matplotlib (DejaVu lacks the glyph) — the % carries it. Re-rendered. Narrative benchmarks.md + README — already NP/NS, added the 🕐 %NumPy line to the convention block, a %NumPy column to the by-class table, and a caption sentence. DocFX pages (benchmark-matrix.md, benchmark-iterator.md) re-seeded from the regenerated report + sheet; benchmarks.md updated; docfx build clean (0 errors). The dashboard (render_dashboard.py / benchmarks-dashboard.md) already carries this convention (49af3af), so the whole benchmark stack — report, dashboard, iterator sheet, cards, narrative, README — is now identical: NumPy ÷ NumSharp speedup + 🕐 %NumPy.
The clock sat before the figure with the right-align padding landing between them
("🕐 87%"). Moved it to immediately follow the percentage, no space — "87%🕐" — across
every surface, and likewise the metric name (🕐 %NumPy → %NumPy🕐). The alignment padding
now sits before the number (where it belongs) instead of after the emoji.
* render_dashboard.py / nditer_sheet.py: bar values "{pct_str}🕐", headline "85%🕐 of
NumPy's time", legend "%NumPy🕐 = …". Dashboard + sheet regenerated.
* merge-results.py: report legend, status-band table, summary-by-size "%NP🕐" column,
Best/Worst note, and per-suite "%NumPy🕐" column headers. Report regenerated.
* benchmarks.md + README: convention line / table column / caption "%NumPy🕐".
* DocFX pages (matrix, iterator, dashboard) re-seeded; dashboard page note "%NumPy🕐".
docfx build clean.
The matplotlib cards are unaffected (they show "1.80× · 56%" without the emoji — DejaVu
has no clock glyph — so there was never a gap to fix there).
… form pct_str (dashboard/sheet) and pct_fmt (report) switched to a ×-multiplier form for huge slowdowns (np.zeros etc.), so the %NumPy stat showed "880×NP🕐" / "880×" — breaking the NN%🕐 depiction the column promises. Now they always render a percentage: np.zeros reads "87957%" (report) / "88087%🕐" (dashboard) = takes ~880× as long, stated as a share of NumPy's time like every other cell. The ratio column is untouched — it legitimately uses × (0.001×, 12.65×); only the %NumPy formatters changed. Report + sheet + dashboard regenerated, the three DocFX pages re-seeded, docfx build clean.
…g from the report The dashboard and benchmark-report.md disagreed on the SAME cell: np.nansum(f64,100K) read 12.63× on the dashboard vs 12.65× in the report, np.zeros(i64,10M) read 88087% vs 87957%, quantile/percentile likewise — 161 rows printed a different ratio at 2dp between the two committed surfaces. Root cause: merge-results.py computes ratio = NumPy/NumSharp and pct_numpy from the FULL-PRECISION means, then stores numpy_ms/numsharp_ms rounded to 4dp. render_dashboard.py ignored the stored ratio/pct_numpy fields and RE-DIVIDED the rounded ms (r["numpy_ms"] / r["numsharp_ms"]), so every row where the 4dp rounding moved a digit drifted from the report. The report is correct (true ratio of the measured means); the dashboard was a rounding artifact of its own recompute. Fix: the credible loop now consumes r["ratio"] / r["pct_numpy"] straight from the JSON (the same numbers benchmark-report.md prints), falling back to 100/ratio only if pct is absent. Dashboard and report now agree cell-for-cell, and the per-suite/per-dtype geomeans key off the same stored ratios the report's Summary-by-size uses. Regenerated benchmark-dashboard.md (gitignored) and re-seeded the DocFX dashboard page; header preserved, body refreshed. Verified: nansum 12.65×/8%, zeros 0.001×/87957%, quantile 9.89×/10% identical on both surfaces; size tiers match Summary-by-size exactly.
…not run" cells
normalize_op_name dropped measured C# data on the floor whenever the C# benchmark label
and the NumPy suite name differed only cosmetically, so the report showed ⚪ "C# benchmark
not run" for ops that WERE run. Three archive-safe alias passes (applied identically to
both sides, so they only ever merge a true pair):
* empty "()" — a no-arg C# method call "a.flatten()" now meets NumPy's "a.flatten"
* "->" spacing — C# "reshape 2D -> 1D" now meets NumPy's "reshape 2D->1D"
* np.around — IS np.round (NumPy alias); C# benchmarks rounding as np.around, NumPy
emits np.round, so the whole np.round family was ⚪ despite real data
Effect (re-merged from the same archive — no re-run): ⚪ no-data 126 → 116; the np.round
family gains 6 real rows (float32/float64 × 3 sizes), a.flatten +2 (100K/10M), reshape
2D->1D +2. Verified against the archive before editing: +10 joined cells, 0 regressions
(no previously-matched cell lost), 0 new key collisions.
Regenerated benchmark-report.{md,json,csv} + the dashboard (now 840 credible cells,
0.73× geomean) and re-seeded the matrix + dashboard DocFX pages (headers preserved
byte-for-byte). The dashboard stays cell-consistent with the report via the canonical
ratio/pct fix from the prior commit.
NOT fixed here (genuine gaps needing a benchmark re-run, not a name alias): np.prod has
no NumPy full-reduction row at all; isnan/isinf/isfinite/isclose/allclose/array_equal/
maximum/minimum have no C# benchmark; amax/amin/mean/std/var axis variants and np.mean
on uint*/int16 lack a counterpart on one side.
…lex (NumPy parity)
These six complex ufuncs previously threw NotSupportedException from the
EmitUnaryComplexOperation default arm, even though NumPy 2.x has complex
loops for all of them (csinh/ccosh/ctanh/casin/cacos/catan). This wires
them up with full NumPy 2.4.2 parity.
Approach (hybrid BCL + C99 fixups, mirroring the existing abs/log2/exp2
pattern): a bit-exact probe over a finite battery showed System.Numerics.
Complex matches NumPy to a few ULP on the finite interior, but diverges at
86/360 edge components -- it returns (NaN,NaN) for nearly all inf/NaN inputs
instead of the C99 Annex G values, drops the sign of zero on branch cuts,
and mishandles arctan's imaginary-axis cut. So:
- NDComplexMath.{Sinh,Cosh,Tanh,Asin,Acos,Atan} delegate the finite
interior to the BCL and add the C99 fixups:
* Non-finite inputs: special-value tables ported from NumPy's msun
npy_csinh/ccosh/ctanh, with asin/atan reusing NumPy's own identities
asin(z)=i*conj(casinh(i*conj z)) and atan(z)=i*conj(catanh(i*conj z)).
* Branch-cut/signed-zero fixups (empirically derived against NumPy and
verified on a 64-point signed-zero grid): asin negates Re on x=-0 and
Im on y=-0; acos negates Im on the y=+0 cut; atan sets
Re=copysign(|y|>1?pi/2:0, x) on the imaginary axis and negates Im on y=-0.
* Where this NumPy build's system libm diverges from msun at infinities
(sign-preserving sinh(-inf+i*inf).re, cosh's even-function +inf*sin(y)
imaginary part, tanh's sign(y) zero, and the genuinely-unspecified
zero signs), the helpers match the observed NumPy 2.4.2 output.
- DirectILKernelGenerator: register CachedMethods.Complex{Sinh,Cosh,Tanh,
Asin,Acos,Atan} (pointing at NDComplexMath, not Complex.* directly) and
add the six cases to EmitUnaryComplexOperation.
Verification: a bit-exact harness over a 117-point battery (finite + signed
zeros + branch cuts + inf/NaN) plus a 64-point grid, diffed against NumPy
2.4.2, gives 1402/1404 components matching (1249 bit-exact, 153 within
<=3 ULP). The only 2 residuals are arctan's finite interior (1e-10 tiny
input ~8e-8 rel; 100+100j at 3 ULP) -- .NET's Atan kernel is less accurate
than NumPy's log1p-based one; an accepted, documented divergence.
Tests:
- NewDtypesUnaryTests: 9 NumPy-verified cases covering interior, branch
cuts, signed zeros, and C99 special values.
- Fuzz/MisalignedRegistry: the stale "complex kernel throws" excuse is
corrected to Half-only; complex sinh/cosh/tanh/arcsin/arccos are now held
to a tight 4-ULP gate (a real regression fails) instead of the blanket
complex-unary excuse; arctan stays under the documented blanket for its
accepted BCL-interior divergence.
All 609 Fuzz + NewDtypes tests pass (net10.0); the 26x5 complex corpus
cases for the five tightly-gated ops are all within 4 ULP.
…e nditer branch Replaces the stale PR description (written ~64 commits in, +50k lines) with a complete changelog of everything between the #612 merge-base (5eedb81) and HEAD: 272 commits, 519 files, +198,407/-16,069 per the GitHub compare. Compiled via a two-pass audit: - Pass 1: every commit subject+body mined for features, perf numbers, and breaking changes; APIs/CI/benchmark/corpus facts verified against the live tree (test counts, fuzz corpus wc, Direct partial count, NpyIter LOC). - Pass 2: all 279 local commits re-walked against the draft. Caught and fixed: np.median/percentile/quantile/average/ptp/tile did NOT exist on master (verified via git grep origin/master) — reclassified from 'rebuilt' to new, raising the new-API count 22 -> 30; removed an unverifiable test count; added the 15-dtype hot-path parity item (786d705) and the DefaultEngine->NpyIter Tier-3B production routing. Scope note: SByte/Half/Complex + DateTime64 + casting rounds are PR #612 (already on master) and are intentionally excluded; the local master ref is stale, which is why master..HEAD misleadingly shows 339 commits. The same content (minus the H1) is now the live PR #611 description, pushed via REST PATCH (gh pr edit requires read:org scope the token lacks).
… 1.3–6.1) Branch advanced 31 substantive commits past the first changelog (which described through 33058b8). The branch was rebased meanwhile — the original changelog commit bb7ed7a8 is orphaned, its twin is 4140f4d, and 33058b8 remains an ancestor of HEAD, so 33058b8..HEAD is the true new-work boundary. Learned and folded in: - np.evaluate — Tier-3C fusion made public; per-node NumPy result_type typing (fixes the mixed-tree dtype bug: i4*i4+f8 must wrap in int32 first), fused reductions, EXTERNAL_LOOP guard, out= via ufunc rules. 3.2–6.1x vs NumPy. - out=/where=/dtype= across the elementwise ufunc API (binary, unary-math, comparisons, predicates, bitwise, invert, arctan2) — one NumPy-shaped overload each, exact broadcast/cast/error-text semantics. - New at np.*: bitwise_and/or/xor (were operator-only, CS0117) and positive. - nditer: WRITEMASKED/ARRAYMASK execution + VIRTUAL operands (was silent masked-write corruption); Wave-1.4 fixes (size-1 stride-0 invariant, op_axes OOB, write-broadcast validation, PARALLEL_SAFE, unit-axis absorb). - Alloc Wave 2.4: buffer-pool window 4KiB–1MiB -> 1B–64MiB, pool-side GC pressure, finalizer suppression. - Canonical NpyIter benchmark suite + post-release benchmark.yml CI + DocFX Benchmarks-vs-NumPy website pages; honest frontier findings recorded (broadcast-reduce 54x, scalar np.any 14.5x, BUFFERED+REDUCE ForEach P0 crash, parallel banding 4.7x win). Stats refreshed: 272/519/+198k -> 312 commits, 615 files, +217,949/-16,402. Tests: 9,447 -> 9,709 passed/0 failed (net10.0). New-API count 30 -> 35. Same content (minus H1) pushed live to the PR #611 description via REST PATCH.
…m the changelog Per review: the changelog should describe the final state, not the development path. Removed the temporal 'Latest wave (Waves 1.3–6.1) — added after the first changelog' umbrella section entirely and dissolved its content into the proper topical sections, with all 'wave' terminology and 'added after'/'previously absent'/'now reachable' path-language gone: - np.evaluate folded into §2 (NpyExpr DSL): per-node result_type typing, fused reductions, out= rules, EXTERNAL_LOOP guard, measured speedups. - out=/where=/dtype= ufunc kwargs folded into §5 as a parity subsection. - WRITEMASKED/ARRAYMASK execution, VIRTUAL operands, and the size-1 stride-0 / op_axes-OOB / write-broadcast / PARALLEL_SAFE / unit-axis fixes folded into §1 (capability matrix + bug list); masked-write corruption fix added to §10. - buffer-pool window (1 B–64 MiB), pool-side GC pressure, finalizer suppression folded into §7; TL;DR memory bullet updated. - canonical NpyIter benchmark, benchmark.yml CI, DocFX benchmark pages, and the honest frontier findings folded into §8/§15. - 'NPYITER_GAPS_AND_ROADMAP … 6-wave plan' -> 'prioritized roadmap'. Net: zero 'wave' occurrences; the 16-section topical structure is intact. Same content (minus H1) pushed live to the PR #611 description.
…ndentals Adds AggressiveInlining/AggressiveOptimization to the complex hyperbolic and inverse-trig helpers and restructures them into a hot/cold split, so the JIT folds the per-element math into the IL-emitted unary kernel without a call frame: - Sinh/Cosh/Tanh/Asin/Acos/Atan (+ Abs and the tiny IsNegZero/IsPosZero/ HypotInf/ClogLarge helpers) are marked AggressiveInlining. Each public op is now a tiny finite-path wrapper (finite check -> Complex.* + fixups, or a cold-helper call) so it fits the inliner's budget. - The non-finite C99 special-value tables move into cold helpers (SinhSpecial/CoshSpecial/TanhSpecial/CasinhNonFinite/CacosNonFinite/ CatanhNonFinite) marked AggressiveOptimization -- kept out-of-line (so the hot wrapper stays inlineable) and fully optimized when actually hit. Behavior is identical to the prior inline form (verified below). IL-inlining experiment (the "emit the formula instead of call" question): benchmarked complex sinh both ways over 4M finite elements, median of 15 reps. The real-decomposition formula (Math.Sinh(x)*Math.Cos(y), Math.Cosh(x)*Math. Sin(y)) is bit-identical to Complex.Sinh (0/4M mismatches) but only 1.15x faster than the call; cosh 1.06x; asin/acos/atan have no real-Math.* formula (dominated by complex log/sqrt) so inlining would only drop a wrapper frame. The per-element cost is dominated by the transcendental itself, so emitting ~6 hand-written IL formulas is not worth the duplication/risk -- especially as the call-based kernel is already ~1.56x faster than NumPy 2.4.2 (np.sinh: 26.1 ns/elem vs NumPy 40.9). Decision: keep the handwritten methods; the inlining attributes capture the (small, safe) wrapper-elimination gain. Verified: NewDtypesUnaryTests + Fuzz UnaryExtra (4-ULP complex gate) green (62/62); the hot/cold split changes no results.
…exponential The np.allclose / np.random.exponential working-set leak guards (np.allclose.UsingTests, np.random.exponential.UsingTests) failed in CI with hundreds of MB of working-set growth (e.g. 551 MB on Linux, threshold 20 MB), while passing on Windows. Root cause: both functions allocate several NDArray intermediates per call and never dispose them — the unmanaged buffers ride the finalizer queue instead of being released synchronously. In a tight loop the managed wrappers are tiny so the GC rarely runs, leaving the intermediates LIVE between collections; the allocator can't reuse that memory, so the high-water mark balloons. On glibc (Linux) freed pages are retained in the arena, so the process RSS stays high even after the test's final GC.Collect()+WaitForPendingFinalizers() — hence the large WorkingSet64 delta. np.isclose: |a-b| <= atol + rtol*|b| materialized ~5 float64 temps (≈400 KB each at 50K elements) plus several bool temps, none disposed. Wrapped every fresh allocation in `using` (the elementwise operators/ufuncs each return a new array). x/y come from astype(copy:false), which returns the input itself when no conversion is needed, so they are caller-owned and never disposed here. The final combined array is captured in `using` too: MakeGeneric<bool>() takes its own refcount on the shared buffer, so disposing the backing temp on return keeps the result alive while keeping that last buffer off the finalizer queue. np.random.exponential: β·(-log(1-U)) left uniform, (1-U) and negate() intermediates un-disposed (only the log result was released). Now disposes all of them; only the trailing `* scale` allocates the fresh array returned to the caller. Effect (measured, peak WorkingSet64 growth across a 1000-iter no-GC loop — the CI failure mode): allclose 551 MB -> 3 MB, exponential -> 1 MB. Behavior is unchanged: full suite green on net8.0 and net10.0 (9718 passed / 0 failed under the CI filter), including the Logic fuzz corpus and the isclose/allclose/ exponential unit tests.
…ity with NumPy 2.4.2) The complex (complex128) overloads of the unary math ops deferred to System.Numerics.Complex for their finite interior. The BCL transcendentals diverge from NumPy on a wide range of edge inputs — large magnitudes, the unit circle, tiny/subnormal values, branch cuts and signed zeros — because they do NOT implement the careful FreeBSD msun algorithms NumPy uses. This replaces those deferrals with direct ports of NumPy's own routines in NDComplexMath, verified by a 504-point bit-exact sweep (Python struct-packed int64 references) classifying every result as exact / <=3 ULP / signed-zero / special / sign-flip. Result: 18 of 20 complex unary ops are now at full parity (0 divergence beyond <=3 ULP): exp, log, log10, log2, log1p, expm1, exp2, sin, cos, tan, sqrt, square, reciprocal, negative, sinh, cosh, tanh, arcsin, arctan. Algorithms ported / fixes (src/NumSharp.Core/Utilities/NDComplexMath.cs): - Log (npy_clog): real part = log|z| with the four-regime rescale — |z| huge (x2 down), subnormal (x2^53 up), near the unit circle (0.71<=|z|<=1.73 uses 0.5*log1p((m-1)(m+1)+n^2) via a Goldberg MathLog1p), and 0. Complex.Log cancels the real part to 0 near |z|=1 (e.g. log(1+1e-10 i).real must be 5e-21, not 0). ComplexLog is repointed here, so np.log, np.log2 and np.log10 all inherit the accuracy. - Tanh (npy_ctanh): Kahan's algorithm (t=tan(y); beta=1+t^2; s=sinh(x); rho=sqrt(1+s^2); tanh=(beta*rho*s + i t)/(1+beta*s^2)) plus the |x|>=22 overflow-safe branch. The BCL Complex.Tanh drifts ~33 ULP (tan(1.5) through the tan(z)=-i*tanh(iz) identity). - Sin/Cos/Tan: now route ALWAYS through Sinh/Cosh/Tanh, exactly as NumPy defines npy_csin/ccos/ctan (= -i*sinh(iz) / cosh(iz) / -i*tanh(iz)), so they match NumPy bit-for-bit instead of only on the BCL's finite interior. Fixes sin(0+1e300 i).real = NaN (BCL did cosh(huge)*0); the Sinh/Cosh y==0 guard returns (sinh(x), y)/(cosh(x), x*y) so a large real no longer yields inf*0 = NaN. - Expm1 (nc_expm1): real = expm1(x)*cos(y) - 2 sin^2(y/2), imag = exp(x)*sin(y); the real expm1 fallback uses the Goldberg identity (e^x-1)*x/log(e^x) which recovers the ~10 digits exp(x)-1 cancels and avoids underflow (expm1(1e-300)=1e-300, not 0). Fixes the non-finite imaginary (expm1(+Inf+0i).imag = exp(+Inf)*sin(0) = NaN) and origin signed zeros. - Square (z*z with FMA contraction): (fma(re,re,-(im*im)), fma(re,im,im*re)). NumPy's complex multiply is FMA-contracted, so square(1e-10+1e-10 i).real = -2.275e-37 (exact re^2 minus rounded im^2) and square(1e300+1e300 i).real = -inf; Complex.op_Multiply (no FMA) returned 0 and NaN. - Atan (npy_catanh, full): atanh(x) on the real axis, atan(y) on the imaginary axis, and the log1p(4|x|/sumsq(|x|-1,|y|))/4 interior, plus _sum_squares and an exponent-classified _real_part_reciprocal (raw biased-exponent field, NOT Math.ILogB which maps 0/Inf to int.MinValue/MaxValue and overflows the subtraction). Complex.Atan cancelled / underflowed the tiny imaginary part (arctan(0+1e-10 i).imag must be 1e-10). - Exp (npy_cexp): exp(-Inf + I(Inf|NaN)).imag = copysign(0, y) so exp(-inf-inf i).imag = -0 (the system libm keeps sign(y); npy_cexp's flat (0,0) dropped it). exp2 inherits this. - Reciprocal already used Smith's nc_recip (overflow-safe, correct signed zeros). Engine wiring (DirectILKernelGenerator[.Unary.Decimal].cs): ComplexLog repointed to NDComplexMath.Log; new cached methods ComplexExpm1 and ComplexSquare; the Expm1 and Square cases in EmitUnaryComplexOperation now call the ported helpers instead of inline Complex.Exp(z)-1 / Complex.op_Multiply. Accepted residuals (pathological inputs only, documented in code + the fuzz registry): - cos/sin with a NaN imaginary part: the resulting zero's sign is C99-UNSPECIFIED; the platform libm and the npy_ccos identity pick opposite signs (2 cases). - arccos with a sub-DBL_MIN imaginary part: Complex.Acos flushes the denormal real part to 0 where cacos's _do_hard_work keeps it (~5.8e-309); a denormal-range edge (4 cases). - sinh/cosh at the overflow boundary |x| in [710, 710.13]: Windows' CRT sinh overflows to inf while .NET Math.Sinh stays finite (a platform-libm boundary, absent on glibc). Tests: NewDtypesUnaryTests.cs adds 11 NumPy-2.4.2-verified cases for the huge-imaginary sin/cos, large-real sinh/cosh overflow, Kahan tan accuracy, near-unit-circle clog, scaled log10/log2/log1p, Goldberg expm1, exp2, FMA square, reciprocal signed zeros, catanh tiny/large arctan, and the exp -inf signed zero. Fuzz/MisalignedRegistry.cs tightens the complex-unary gate to <=3 ULP across the whole set (was a 4-ULP gate on 5 ops + a blanket excuse for the rest), narrows the >3-ULP excuse to the named pathological ops, and adds a separate entry for the (pre-existing) complex reduction/scan NaN-ordering divergence the old blanket covered. Full CI-style suite (net10.0, exclude OpenBugs/HighMemory): 9729 passed, 0 failed. net8.0 + net10.0 both build clean.
…-run
Adds the benchmark definitions that were missing on one side of the op-matrix join (so the
ops showed ⚪ "not run" or were discarded as C#-only), then re-runs the whole official suite
(all 14 comparison suites x 3 cache tiers, ~3h) to fill them in with live numbers. Result:
⚪ no-data 130 -> 76, and the headline moves from a stale 0.74x to 1.08x geomean (93%🕐)
over 1386 credible cells — the stale figure was dragged down by a broken searchsorted and by
simply missing most of NumSharp's fast reductions.
NumPy side (numpy_benchmark.py) — C# already benchmarked these; NumPy didn't:
* unary: np.tan, np.exp2, np.expm1, np.log2, np.log1p, np.clip(a,-10,10),
np.power(a,2|3|0.5)
* reduction: np.cumsum (all arithmetic dtypes), np.prod + np.prod axis=0/1, and the axis
variants np.amax/np.amin/np.mean axis=0(/1) and np.var/np.std axis=0
All names normalize to the existing C# [Benchmark(Description=...)] so they join 1:1.
C# side:
* ProdBenchmarks: was non-standard sizes (100/1000/10000) + method-form names (a.prod());
nothing could join it. Switched to the standard Small/Medium/Large tiers and function-form
np.prod(a)/np.prod(a, axis=k) — values stay in [0.5,1.0] so the product is overflow-safe at
every size. prod now has full + axis coverage (18 cells).
* MeanBenchmarks: CommonTypes -> ArithmeticTypes, closing the np.mean uint*/int16 ⚪ holes
(15 cells) — matches SumBenchmarks/MinMaxBenchmarks.
* LogicBenchmarks: isnan/isinf/isfinite/maximum/minimum/array_equal now join (54 cells).
Verified on the fresh run: searchsorted is purged of the 0.0000ms / >1e6x rows (now real,
1.16-1.44x faster), prod/cumsum/all axis reductions/the 6 predicates/mean-on-uint* all matched.
Regenerated benchmark-report.{md,json,csv} + dashboard and re-seeded the matrix + dashboard
DocFX pages.
KNOWN BUG surfaced (left as ⚪): np.isclose and np.allclose DETERMINISTICALLY segfault NumSharp
with the unmanaged-storage AccessViolation — each crashes even run alone, and in-class it killed
the whole logic suite before BenchmarkDotNet could export anything (took the 6 working predicates
down with it). Disabled both in LogicBenchmarks with a documented note; re-enable once the
NumSharp isclose/allclose lifetime bug is fixed. The 6 predicates were recovered by running each
in its own process (the same per-section isolation the NDIter harness uses for its AV).
…segfault) + doc review Parity review of the complex unary math overloads (commit 416affc). Verified all 20 affected ops across memory layouts (contiguous / F-contiguous / strided / transposed / both negative-stride directions / sliced-offset / broadcast / 0-d / empty) with a fresh bit-exact sweep — every op is layout-correct with 0 divergence — and confirmed the out=/where= ufunc parameters compose bit-exactly with the new complex kernels (exp returns the same out instance; sqrt's where=mask preserves masked-off slots). The review surfaced a pre-existing MEMORY-SAFETY bug (segfault), now fixed: np.exp(complex_array, dtype=float64) # and sqrt/log/log2/log10/log1p/expm1/exp2/sin/cos/tan/ # sinh/cosh/tanh/arcsin/arccos/arctan segfaulted instead of raising. Root cause: ResolveUnaryFloatReturnType honored an explicit dtype= override after only rejecting integer/bool targets (over < Single -> "No loop matching"). It never checked that the INPUT can reach the requested loop dtype by a same_kind cast. For a complex input + real-float dtype=, it returned the real type, ExecuteUnaryOp allocated an 8-byte/element output buffer, and the 16-byte/element complex kernel overran it. NumPy 2.4.2 raises instead: "Cannot cast ufunc 'exp' input from dtype('complex128') to dtype('float64') with casting rule 'same_kind'" Fix: ResolveUnaryFloatReturnType now calls the existing ValidateUnaryInputCast (already used by square/reciprocal/negative, which were NOT affected) on the override path. This reuses NDIterCasting.CanCast(SAME_KIND), so it allows the legal narrowings (int->float32, float64-> float32, float->complex) unchanged and rejects only the cross-kind complex->real cast, emitting NumPy's verbatim message. Probe matrix (complex/float/int inputs x float/complex/int dtype=) now matches NumPy across all 17 float-producing complex ufuncs; the order is preserved (integer dtype= still raises "No loop matching" before the cast check). Also refreshes the NDComplexMath class doc comment, which still described the old fork state ("sinh/cosh/tanh/asin/acos/atan delegate straight to System.Numerics.Complex", "arctan's BCL interior is the lone documented divergence") — it now lists the actual ported algorithms (npy_clog, Kahan ctanh, csinh/ccosh, npy_catanh, npy_cexp/csqrt, nc_expm1/Goldberg, FMA square, nc_recip), the two ops still delegating (asin/acos at parity), and the three accepted pathological residuals. Tests: NewDtypesUnaryTests.cs adds Complex_FloatUfunc_NarrowingDtype_RaisesCastError_NotSegfault (exp/log/sqrt/sin/tanh/arctan: complex+dtype=float64 raises the verbatim cast error, complex+ dtype=int64 raises "No loop matching", complex+dtype=complex128 returns complex). Full CI-style suite (net10.0, exclude OpenBugs/HighMemory): 9730 passed, 0 failed. net8.0 + net10.0 build clean. Note: ceil/floor/round/trunc on complex reject cleanly (no segfault) but with NumSharp's own message rather than NumPy's "ufunc not supported for the input types" — left as-is (out of scope; NumPy has no complex loop for them either). The int->exp2 InvalidProgramException (Single-output kernel) remains a separate, already-tracked bug (fuzz registry W3-C), unrelated to complex.
…o (match/beat NumPy 2.4.2) np.zeros was ~1000x slower than NumPy for large arrays (10M float64: 14.3 ms vs NumPy 0.011 ms). Root cause: it allocated an uninitialized buffer and then ran an eager per-element Fill loop that touched (and zeroed) every byte. NumPy instead delegates zeroing to the OS: PyDataMem_NEW_ZEROED -> calloc, whose demand-zero pages are committed and zeroed lazily on first write, so allocating zeros is effectively O(1) regardless of size (numpy/_core/src/multiarray/alloc.c npy_alloc_cache_zero: small sizes use a cache+memset, large sizes calloc). This ports NumPy's structure. The zeroing is now done by the allocator/OS, never an element loop — correct for all 15 dtypes because the all-zero bit pattern equals default(T) for every one of them (incl. Half, Single, Double, Decimal, Complex). Implementation -------------- - ArraySlice.Allocate(..., fillDefault: true) and Allocate<T>(..., true) now route to UnmanagedMemoryBlock<T>.AllocateZeroed instead of `new UnmanagedMemoryBlock<T>(count, default)` (Take + scalar Fill). All np.zeros overloads, np.zeros_like, np.eye/np.identity, and every internal fill-with-default allocation flow through here. - SizeBucketedBufferPool.TakeZeroed: NativeMemory.AllocZeroed (calloc) with no dirty-bucket reuse — a recycled buffer would force a full memset, discarding the lazy demand-zero win for large sizes and being no cheaper than calloc for small ones. - OsVirtualMemory (new, Windows-only): the Windows process heap eager-commits and memsets mid-size calloc requests (~256 KiB-2 MiB, ~0.05 ms for 800 KiB), unlike glibc/macOS which mmap large blocks lazily. For >= 128 KiB on Windows AllocateZeroed uses VirtualAlloc(MEM_COMMIT) (copy-on-write zero pages, ~0.002 ms) and a new Disposer AllocationType.Virtual that releases straight to the OS via VirtualFree (not pooled). Non-Windows and small sizes stay on calloc, which is already lazy/cheap there. Benchmark fix (pre-existing bug) -------------------------------- CreationBenchmarks returned each created array without disposing, leaking one buffer per op. NumPy's harness (numpy_benchmark.py) discards each result inside the timed loop, so CPython refcount frees it immediately — i.e. NumPy measures alloc+free while the C# benchmark measured alloc-only (unfair) and leaked. Under BenchmarkDotNet's thousands-of-ops-per-iteration, every untouched-but-committed buffer still charges Windows commit, so any fast creation op OOMs at 10M (np.empty(10M) already did; the old np.zeros only escaped by being slow enough to throttle BDN to a couple ops/iteration). All creation benchmarks now dispose per op, matching NumPy and bounding resident memory. Results (this machine, vs NumPy 2.4.2; BDN alloc+free) ------------------------------------------------------ - 10M float64: 14.3 ms -> 0.0033 ms (was ~1000x slower; now 3.1x faster) - medium (100K): 1.7-3.8x faster across i32/i64/f32/f64 - large (10M): 1.1-3.5x faster across i32/i64/f32/f64 - small (1K): ~1.5-2x slower — bounded by NDArray object construction (NDArray/Storage/Shape/ArraySlice/Disposer), shared by all creation APIs and sub-microsecond; the allocation itself is optimal. Tests ----- New Creation/np.zeros.AllocationTests.cs (12 tests): all 15 dtypes zeroed across heap/VirtualAlloc size regimes, full-scan of a multi-MB array, VirtualAlloc writeability/commit correctness, OwnsData, non-aliasing, reuse-after-dispose, multi-dim/high-rank/empty/sliced, default dtype, all overloads. Full CI suite (net8.0 + net10.0, excluding OpenBugs/HighMemory) green: 0 failed, 9742 passed.
…o longer crashes) nd[(nd < 3)] = -2 — assigning a scalar into a boolean-mask-selected subspace — used to trip a Debug.Assert and kill the test host (the test pre-threw to dodge it). The broadcast-value assignment path (SetIndicesND scalar/broadcast handling) fixed it; the whole flow now matches NumPy: nd = [[1,2,3],[4,5,6]]; nd[nd < 3] = -2 -> [-2,-2,3,4,5,6] nd[(nd == -2) | (nd > 5)] -> [-2,-2,6] Removed the pre-throw guard and the [OpenBugs] attribute. Tests: NDArray.Indexing.Test class 123 passed / 0 failed (net8.0).
…he int[] overload The differential index oracle (NumPy vs NumSharp, 2265 getter/setter cases x layouts + 104 dtype cases) surfaced 12 setter divergences, all via the object[] single-int setter b[(object)0] = v and the long[] coordinate shim b.SetData(v, 0L): b = arange(12).reshape(3,4) b[(object)0] = -1 -> [-1,1,2,3,...] NumPy: [-1,-1,-1,-1,...] (fills row 0) b[(object)0] = [1,2] -> partial write NumPy: ValueError ((2,) into (4,)) Root cause: SetData(NDArray, params long[]) carried the OLD logic the int[] overload had before the broadcast fix — for a scalar it wrote only the FIRST element of the sub-array (not a fill), and for a larger/smaller value it linear-copied with no broadcast validation. (b[0] = v via a literal int goes through the Slice path and was always correct; only the object[]/long[] entry points were wrong.) Fixed by delegating SetData(NDArray, long[]) to the corrected SetData(NDArray, int[]) so scalar-broadcast-across-subarray, value broadcasting/tiling, and the NumPy shape-mismatch ValueError all apply uniformly. After the fix the differential sweep is 2265/2265 + 104/104 = 0 divergences (1-to-1 parity: success/failure agreement, exact shape, bit-exact gathered values). Added regression ObjectArraySingleInt_Setter_BroadcastsAndValidates. Tests: full CI suite 10977 passed / 0 failed / 11 skipped (net8.0).
…y-safety) The differential index oracle (random-fuzz layer over exotic mixed advanced-index combinations) found that over-indexing a rank-N array with more than N advanced index arrays walked strides past the end of the shape: the subshape was sized `srcShape.NDim - ndsCount` (negative) and the offset/getter loops dereferenced strides[i] beyond the array -> OOB read/write (heap corruption / OverflowException). Added a guard in FetchIndices<T> and SetIndices<T>: ndsCount > source.ndim now raises IndexError "too many indices for array: array is N-dimensional, but M were indexed" (NumPy parity) before any unsafe stride math runs. Tests: full CI suite 10978 passed / 0 failed / 11 skipped (net8.0). Note: the differential sweep still surfaces deeper mixed advanced-index divergences (multi-dim fancy + slice + 0-d bool + newaxis/empty combinations) and a separate flaky OOB in that path — tracked separately; the curated common-surface sweep (2369 cases x layouts x dtypes) remains 0 divergences.
A differential index oracle (NumPy 2.4.2 vs NumSharp, curated 2369 cases + a seeded random-fuzz layer) proved the curated common surface is bit-exact (0 divergences) but ~660-700 divergences remain across EXOTIC mixed advanced-index combinations (bool-array+fancy, multi-dim fancy+slice, 0-d-bool+fancy, multi-fancy, empty combos) plus a flaky heap-corruption crash in that path. Handover documents how to close it by porting NumPy mapping.c's unified two-stage model (prepare_index + MapIterNew/_get_transpose) to REPLACE the current per-shape Try* fast-path patchwork, which cannot generalise. Covers: - precise divergence categories (by failure mode and by index-form feature) - why the patchwork architecture cannot reach parity - the NumPy algorithm with file:line citations (mapping.c) - a phased plan: lock the gate -> hunt the OOB crash -> PrepareIndex -> unified MapIter gather/scatter -> edges/overlap - keep-vs-replace map of every existing indexing helper - the differential harness (token encoding, base recipes, run/regenerate) and how to promote it into the committed test/oracle + [FuzzMatrix] gate - memory-safety crash hunt (page-heap/GCStress to catch the delayed OOB at the write) - DOD, risks, first-day checklist Successor to advanced-index-axis-placement.md (which resolved the two-advanced+slice sub-case via TryBuildMultiAdvancedGrid).
… (Phase A)
Promotes the scratchpad getter/setter differential harness into the committed
oracle pipeline, per docs/plans/advanced-index-combinatorial-handover.md Phase A.
This is the gate the full mapping.c port (Phases C-E) must drive to 0/0; until it
is committed it cannot defend the fix.
What lands:
- test/oracle/gen_index_oracle.py — NumPy 2.4.2 oracle. Emits a portable TOKEN corpus
(index fields encoded as [int n]/[slice ...]/[new]/[ell]/[arr flat shape]/
[barr ...]/[b0 bool]/[a0 n]; values [scalar n]|[arr ...]) across 15 base recipes
(S,V0,V1,V6,A,AT,ARS,ACS,ANR,ANC,ASO,ABC,B,BT,E03) for get+set, a 13-dtype sweep,
and a seeded random-fuzz layer. Writes JSONL into Fuzz/corpus/ (csproj glob copies
it to test output — no Python at test time, matching the existing FuzzMatrix gates).
- Fuzz/IndexOracleTests.cs — [FuzzMatrix] replay. Rebuilds the SAME base+index from
tokens, runs get/set, bit-compares shape + int64 values + which-side-raised.
- Three corpora:
index_curated.jsonl (2265) — deterministic matrix, CI gate
index_dtype.jsonl (104) — forms x 13 dtypes, CI gate
index_random_20240626.jsonl (10000) — seeded fuzz, the target
Gate status (reproduced, both frameworks):
- Index_Curated + Index_Dtype: 0 divergences (green, run in CI as FuzzMatrix).
- Index_Random: ~697 divergences (209 throws-on-valid, 404 accepts-invalid,
84 shape/value) + a flaky heap-corruption AccessViolation in the mixed-advanced
path. Marked [OpenBugs] so CI excludes it (avoids the crash) until Phases C-E land;
un-marked at Phase E per the handover DOD.
The curated/dtype gate pins the 13 indexing fixes already on nditer (b03e40b7..998c1d23)
so they cannot regress while the combinatorial port proceeds.
…pt-in page-heap (Phase B) Memory-safety hardening for the advanced-index path, per docs/plans/advanced-index-combinatorial-handover.md Phase B. Block-copy bounds guards (permanent fix) ---------------------------------------- The fancy gather/scatter copy one subShapeSize block per selected offset but the upstream bound check only validated each block's START offset, not its full extent. A miscomputed retShape/subShape (the exotic mixed-advanced combos the per-shape Try* dispatch mishandles) therefore copied past the end of a pinned/native buffer -> silent heap corruption (the flaky AccessViolation the differential sweep surfaced). Each raw block-copy / odometer / value-read site now validates the WHOLE span against the real buffer capacity (Shape.BufferSize) and throws a tagged IndexOutOfRangeException instead of corrupting: - FetchIndicesND (getter contiguous block gather) - FetchIndicesNDNonLinear (getter strided odometer gather) - SetIndicesND (+ non-linear) (setter block scatter) - SetIndices non-subshaped value read (value shorter than the selection) Shared IndexingOobMessage() names the offending copy + computed retShape/subShape so a divergence is traced to the mishandled index combination. Opt-in page-heap (diagnostic infra, zero production impact) ----------------------------------------------------------- SizeBucketedBufferPool gains a NUMSHARP_GUARD_PAGES=1 mode (Windows only, read once at startup): every pool Take hands back a buffer whose last byte abuts an inaccessible PAGE_NOACCESS guard page (OsVirtualMemory.AllocGuarded/FreeGuarded), bypassing reuse, so a one-past-the-end write into a POOL buffer faults instantly at the offending access. Default OFF — Take/TakeZeroed/Return and the np.zeros VirtualAlloc bypass keep their exact production paths. Findings (recorded for Phase C) ------------------------------- Every corruptor case the sweep names is an index combination NumPy REJECTS that NumSharp's Try* stack wrongly accepts and feeds malformed shapes to a kernel: V6[arr([3,1]), 2, arr([])] -> over-indexed 1-D (NumPy IndexError) A[barr([F],(1,)), None, barr([F,F,F,F],(4,))] -> bool length != axis (NumPy IndexError) The guard pages did not fault on these in isolation because the overrun target is a FromArray-pinned MANAGED index array (not a pool buffer), confirming the real fix is up-front validation: Phase C's prepare_index rejects these before any kernel runs, which structurally eliminates the whole OOB class. The block-copy guards above remain as a defense-in-depth backstop. No regression: 1299 indexing/selection tests + the Index_Curated/Index_Dtype gate green.
…ate (Phase C)
Implements docs/plans/advanced-index-combinatorial-handover.md Phase C: a faithful port
of NumPy 2.4.2's prepare_index (numpy/_core/src/multiarray/mapping.c:262 prepare_index_noarray)
that classifies and VALIDATES the whole index tuple in one pass before any per-shape Try*
fast path runs. This replaces the scattered, per-shape validation that let the heuristic
stack accept combinations NumPy rejects and feed malformed shapes to a kernel.
New file Selection/NDArray.Indexing.PrepareIndex.cs:
- IndexType (NumPy HAS_* bitmask), IndexKind, IndexOp, PreparedIndex.
- PrepareIndex(Shape, object[]): the classification cascade (ellipsis / newaxis / slice /
integer / 0-d-bool / k-d-bool->nonzero / integer-array / 0-d-array-scalar / invalid),
the ellipsis fill + HAS_SCALAR_ARRAY cleanup + a[()] special case, then the post-walk
validation NumPy does once axis placement is known:
* too-many-indices -> 'array is N-dimensional, but M were indexed' (mapping.c:665)
* boolean array dim -> 'boolean index did not match indexed array along axis A...' (:709)
* integer/array VALUE bounds -> 'index N is out of bounds for axis A with size S'
* advanced block broadcast -> 'shape mismatch: indexing arrays could not be broadcast
together with shapes ...' (:2617) [bit-exact message, NumPy-verified]
* single ellipsis / non-integer-or-boolean array -> the verbatim IndexErrors.
- Wired as a gate at the top of FetchIndices/SetIndices(object[]) for every multi-index
tuple (indicesLen != 1); valid tuples pass straight through to the existing dispatch.
Impact (differential random sweep, seed 20240626):
- ns-accepted-invalid 404 -> ~7 for the index-structure classes (over-index, bool-length
mismatch, OOB index value, un-broadcastable advanced) — the combinations that also drove
the mixed-advanced heap corruption now raise BEFORE any kernel, removing that OOB source.
- Total divergences ~697 -> ~440 (windowed; the residual ~89 accepts-invalid are SETTER
value-broadcast cases -> Phase E, and the shape/value + rejects-valid buckets -> Phase D).
- A residual flaky wrong-shape overrun into a pinned managed index array survives in the
random sweep only (the [OpenBugs] gate, CI-excluded); it is one of the wrong-shape
divergences Phase D's exact axis placement eliminates (handover: correct shapes => no OOB).
Tests: full net8.0 suite 10980 passed / 0 failed; net10.0 indexing + Index_Curated/Index_Dtype
gate green. IndexNDArray_Case10_Multi now expects IndexError (NumPy-correct; was the non-NumPy
IncorrectShapeException) and GetIndicesFromSlice's reflection proxy matches by name (PrepareIndex
also takes Shape as its first parameter).
Slice.ToSliceDef's negative-step branch clamped a start more negative than -dim to 0,
yielding a spurious length-1 slice that began at index 0; NumPy clamps it to -1 ('before
the beginning' when walking backwards), making the slice empty.
arange(3)[-7::-2] NumPy [] was NumSharp [0]
arange(3)[-4::-2] NumPy [] was NumSharp [0]
arange(2)[-7:-3:-2] NumPy [] was NumSharp [0]
In-range negative starts (e.g. [-2::-1] == [1,0], [-1::-1] == [2,1,0]) and the positive-step
branch are unchanged — only an out-of-lower-bound negative start with a negative step is
affected. Surfaced by the differential index sweep (13 pure-basic-slice divergences in the
first 5000 random cases, now 0).
Tests: 2028 indexing/slice/view + Index_Curated/Index_Dtype gate green; no regression.
…eg-stride offset bound (Phase D) Two coupled fixes that make TryBuildMultiAdvancedGrid the single advanced-index gather for ALL HAS_FANCY tuples (mapping.c MapIterNew axis placement), replacing the np.take fast path: 1. largestReachableOffset (neg-stride bound). FetchIndices/SetIndices validated gather offsets against GetOffset(size-1 corner), which for a NEGATIVE-stride view is the MINIMUM corner, not the maximum — so valid early-row offsets on a[::-1]/a[:,::-1] were rejected as out of bounds (IndexOutOfRangeException). Now bounded by the true max reachable offset (base + per-axis positive-stride contribution; == size-1 when contiguous, unchanged for positive strides). 2. Grid handles a SINGLE advanced index. TryBuildMultiAdvancedGrid required >=2 advanced axes, so a single MULTI-DIM fancy array mixed with a slice (a[arr(2,2), 1::2]) fell through to the non-general broadcast path and dropped the slice's output axis (NumPy (2,2,1) -> NumSharp (2,2)). Lowered to >=1; with the offset fix the grid now also subsumes the 1-D fancy + slice/ newaxis cases the np.take path mishandled (newaxis axis arithmetic, negative slices, non-contiguous sources all threw ArgumentOutOfRange/IndexOutOfRange on valid input). The getter no longer calls TryFetchSliceWithSingleAdvanced (now unreferenced; the whole Try* stack is removed in the final Phase D cleanup once the setter is migrated too). Impact (random sweep [0,5000)): divergences 123 -> 74; shapeDiff 41 -> 17, the np.take ArgumentOutOfRange bucket (20) eliminated. The remaining threw-on-valid are empty advanced indices (arr([])/barr([])) and 0-d-bool combos (Phase E). Tests: net8.0 1681 indexing/selection/slice + net10.0 1216 indexing, Index_Curated/Index_Dtype gate green on both.
…cy (Phase E) NumPy force-casts any size-0 index array to intp and treats it as an empty integer fancy index, never a boolean mask (mapping.c:425): A[np.array([], bool)] -> (0,4), not a 'boolean index did not match' length error. NumSharp routed an empty bool array through the boolean-mask path, which enforced length==axis and threw. Fixed in all three classification sites: - PrepareIndex (multi-index tuples): a size-0, ndim>=1 array becomes an empty FancyArr (bool cast to int64), consuming one axis with a 0-size block dim. - The single-index getter and setter NDArray paths (which bypass PrepareIndex): an empty array routes to the empty-fancy gather/scatter instead of BooleanMask. Empty INTEGER fancy was already correct (A[empty_int] -> (0,4), B[:,empty_int] -> (2,0,4)); this extends the same to empty bool. Random sweep [0,5000): divergences 74 -> 59, threw-on-valid 50 -> 35. Curated/Dtype gate + 1216 net8.0 indexing tests green.
… ValueError (Phase E) UnmanagedStorage.SetData(NDArray,int[]) treated ANY size-0 value as an unconditional no-op (added for np.pad's empty-axis assignment). But NumPy only no-ops when the TARGET region is also empty; assigning an empty array into a NON-empty region cannot broadcast and raises ValueError: A[()] = np.array([]) NumPy ValueError was NumSharp silent no-op A[:] = np.array([]) NumPy ValueError was NumSharp silent no-op Now guarded by the target subShape size: empty-into-empty still no-ops (np.pad preserved), empty-into-non-empty raises the NumPy 'could not broadcast input array from shape ... into shape ...' ValueError. Random setter region [7500,10000): divergences 185 -> 122, ns-accepted-invalid 89 -> 37 (and zero new threw-on-valid — the guard fires only where NumPy raises). Curated/Dtype gate + 1299 net8.0 indexing/selection tests green.
…_BOOL + fancy) (Phase E) A 0-d boolean index (np.array(True)/np.array(False), NumPy HAS_0D_BOOL) mixed with a fancy array used to crash the grid (np.nonzero on a 0-d bool is unsupported) or fall to the broadcast path. NumPy treats it as a length-1 (True) / length-0 (False) array that joins the advanced BLOCK broadcast but consumes NO source axis and adds no output dim of its own. TryBuildMultiAdvancedGrid now models it (new MixKind.ZeroBool): - classified before the k-d-mask case; contributes its (1,)/(0,) array to the block broadcast, consumes no source axis (axisOfItem = -1), counts toward block consecutiveness; - the grid fires when a slice/newaxis OR a 0-d bool is present (a 0-d bool can't use the broadcast path), with >=1 real fancy axis; - advBOf[] maps each block member to its broadcast slot so only real fancy axes get a per-axis index array, while the 0-d bool only shapes the block. Probed vs NumPy 2.4.2 (all exact): A[arr(2,1), True] -> (2,1,4) A[True, arr([2,2])] -> (2,4) V6[True, arr([1,2])] -> (2,) A[True, arr([0,1]),True] -> (2,4) A[arr([0,1]), True, 1] -> (2,) A[arr([0,1]), False] -> IndexError (block (2,) vs (0,)) (The False-mismatch IndexError is already raised up front by PrepareIndex's broadcast-together check.) Random GET sweep: 0-d-bool divergences 56 -> 13. Curated/Dtype gate + 1299 net8.0 indexing/selection tests green.
…l through to fancy) (Phase E)
The setter's pure-basic branch — new NDArray(Storage.GetView(slices)).SetData(values, []) —
had no after it (the getter's equivalent returns its view). So after correctly
assigning through the slice view, control FELL THROUGH into the _NDArrayFound advanced-index
label, which re-interpreted the same slices as fancy index arrays (GetIndicesFromSlice per
axis) and tried to broadcast/scatter them. That re-interpretation threw on every pure-slice /
ellipsis / newaxis assignment:
A[...] = scalar -> ArgumentException 'Value cannot be an empty collection (indices)'
(ellipsis builds an EMPTY advanced index list)
A[:, :] = scalar -> IncorrectShapeException 'objects cannot be broadcast to a single shape'
(slice index arrays (3,) and (4,) can't broadcast together)
A[None] = arr([9]) -> ArgumentException (newaxis -> empty advanced list)
Adding the missing makes the slice view assignment terminal, matching the getter.
Random setter region [7000,10000): divergences 120 -> 62, threw-on-valid 68 -> 10 (the entire
ArgumentException(34)+IncorrectShapeException(14) buckets cleared). Curated/Dtype gate + 1299
net8.0 indexing/selection tests green.
…g index (Phase E) ExpandEllipsisForMixed (used by the 0-d-bool and leading-mask basic handlers) counted a 0-d boolean toward the axis-consuming items, so the ellipsis under-filled by one and the inserted size-1/0 axis landed at the wrong output position: AT[..., True] on (4,3) NumPy (4,3,1) was NumSharp (4,1,3) ABC[..., False] on (3,4) NumPy (3,4,0) was NumSharp (3,0,4) A[..., False,0] on (3,4) NumPy (3,0) was NumSharp (0,4) A 0-d bool (HAS_0D_BOOL) consumes NO source axis, so it is now skipped in the ellipsis fill count alongside newaxis. Curated/Dtype gate + 1216 net8.0 indexing tests green.
…sts to selection (Phase E) The non-subshaped fancy setter branch (a[fancy] = value where the fancy indices cover every axis) wrote the value flat into the selected slots without checking it broadcasts to the indexing-result shape. A value of an incompatible shape was silently partial-written (or caught late by the memory-safety guard) instead of raising NumPy ValueError: V6[[2]] = [1,2,3,4,5] NumPy ValueError (value (5,) into selection (1,)) was silent/IOoR Now it materializes the value to a C-contiguous buffer of exactly retShape via np.broadcast_to (matching the subshaped branch), raising the NumPy 'could not broadcast input array from shape ... into shape ...' ValueError on mismatch; scalar and exactly-matching values keep their fast paths. Random setter region [7500,10000): divergences 52 -> 38, ns-accepted-invalid 39 -> 21 (the residual are empty-selection assignments, which short-circuit before this point). Curated/ Dtype gate + 1299 net8.0 indexing/selection tests green.
…memory safety) (Phase B/E)
UnmanagedStorage.SetData(NDArray, int[]) did NOT wrap negative coordinates, unlike the getter's
GetData(int[]) which calls Shape.InferNegativeCoordinates. So a negative single-index assignment
reached via the object[] setter or the long[] coordinate shim wrote at buffer[-1]:
b[(object)-1] = v wrote ONE ELEMENT BEFORE the buffer (OOB heap write), leaving the array
unchanged (NumPy assigns the LAST element)
This was the mixed-advanced sweep's flaky AccessViolation: a fresh np.arange(6) copy whose
buffer[-1] write corrupted adjacent native-pool/GC memory, fatal only after enough accumulation
(found by amplifying each divergent case in a tight loop until set/V6/rand/7047 = V6[-1]=scalar
crashed). SetData now applies the same InferNegativeCoordinates wrap+bounds-check as GetData:
the last element is assigned, and a genuinely out-of-range index raises NumPy's IndexError.
Random sweep: divergences 93 -> 65; the divergent-case mini-corpus that crashed at this case now
survives 4000x. Curated/Dtype gate + 1299 net8.0 indexing/selection tests green.
…nces, Phases A-E) Updates the combinatorial advanced-indexing handover with an Execution status section: Phases A-C done, D done, E mostly done; differential random sweep 697 -> 64 divergences (91%), curated/dtype gate 0/0, full CI 10980/0 on net8.0 + net10.0. Lists the landed commits per phase and the precise remaining work (setter value-broadcast on empty selections, multi-0-d-bool placement in TryBuild0dBoolWithBasic, multi-dim-fancy+ellipsis+int, a second layout-dependent teardown OOB, and the final Try* cleanup + Index_Random un-mark).
…-advanced (Phase D) TryBuildMultiAdvancedGrid bailed when no slice/newaxis/0-d-bool was present (hasExplicitBasic), sending pure-advanced tuples (fancy+int, fancy+fancy with no slice) to the old _NDArrayFound broadcast path. That path mis-placed a MULTI-DIM fancy combined with an int: AT[..., arr(4,1), -1] NumPy (4,1) was NumSharp (4,3) The grid is NumPy's general advanced-index algorithm (block broadcast + consec-aware placement), so it now fires for any tuple carrying an advanced block member (a fancy array, or a 0-d bool); only pure-basic tuples (slices/ints/newaxis, no block) fall through to the view path. Random sweep 64 -> 61; full indexing/selection suite 1299 net8.0 green.
…tinuation plan Amends the Execution status with a detailed, ordered Remaining work section (R1-R4) plus a Diagnostic tooling subsection, so the open items can be picked up directly: - R1 Setter value-broadcast on EMPTY selections (~25, largest): root cause is the SetIndices<T> empty short-circuits returning before retShape + value validation; fix restructures so retShape is computed first, value broadcast-validated (ValueError), then empty -> no-op; incl. the 0-d-bool-False branch. File:method:line, care/gate, expected delta. - R2 Multi / non-consecutive 0-d-bool placement: TryBuild0dBoolWithBasic lacks the consec rule; route to the grid (which has ZeroBool + consec) or delete it; unit-test the permutation. - R3 Second layout-dependent teardown OOB: writes past a PINNED MANAGED array (page-heap can't catch); re-check after R1/R2, else red-zone FromArray or loop_mini cross-case repetition. - R4 Final cleanup: delete the dead TryFetchSliceWithSingleAdvanced + getter/setter _NDArrayFound (the grid owns all HAS_FANCY now); un-mark Index_Random [OpenBugs] at 0/0. Also documents the scratchpad diagnostic harnesses (replay_index_jsonl / gchunt / loop_each / loop_mini, page-heap, mini-corpus build, the runfile-cache gotcha) and updates the totals to 697 -> 61 / commit range through 9c2e16b2.
…gnment (R1) NumPy requires an assigned value to broadcast to the indexing-RESULT shape even when that selection is empty (contains a 0) or is a single element — a value that cannot broadcast raises ValueError; it is NOT silently a no-op. NumSharp short- circuited three setter paths before validating, accepting what NumPy rejects. Random differential sweep (seed 20240626): the set-side "ns-accepted-invalid" bucket drops 25 -> 0 (61 -> ~36 total divergences). Curated/dtype gate stays 0/0; full CI 10980 pass / 0 fail on net8.0 AND net10.0. Three sub-fixes, all the same NumPy rule (value broadcasts to selection-or-raise; a scalar always broadcasts): 1. 0-d-bool-False mixed with basic (Selection.Setter.cs). The any-False branch of TryBuild0dBoolWithBasic returned with no validation. Now it computes the empty selection shape (this[boolBasic].shape with [boolAxis]=0) and broadcasts the value there, raising the NumPy "shape mismatch: value array of shape X could not be broadcast to indexing result of shape Y" (no-space tuple) on mismatch. Covers e.g. set ANR[slice, b0(False)] = arr([4,75]) and the multi-0-d-bool forms [b0F,b0T,int] / [b0T,b0F,int] (one False -> length-0 block). 2. Scalar-element target (UnmanagedStorage.Setters.cs, scalar-to-scalar branch). When the coordinate consumes EVERY axis the target is a single element (shape ()), where NumPy requires a 0-d / scalar value: a 1+-D array — even size 1, e.g. a[3] = np.array([78]) or m[0,2] = np.array([94]) — raises "setting an array element with a sequence.", it is NOT unwrapped to its first element. The looser valueIsScalary (which also accepts a (1,) array) is correct only for the sub- array broadcast branch, not a single-element target; use valueshape.IsScalar. 3. Boolean-mask zero-select (Default.BooleanMask.cs, BooleanMaskSet). The trueCount==0 early return skipped value validation; arr[allFalseMask] = [93,1,39] into a (0,4) selection now raises the shape-mismatch ValueError. A scalar splat still no-ops. Pre-existing, unrelated failures confirmed against a clean baseline (in CI-excluded categories): HashHelpersLong_GetPrime/_ExpandPrime, Slice2x2Mul_AssignmentChanges- Original (np.arange int64 vs ToArray<int>), Broadcast_Sum_InternalError.
…ices
Indexing a 0-dimensional array (e.g. np.array(5)) with ANY axis-consuming index --
an integer/boolean ARRAY, including an empty one (s[np.array([],int)]), or a raw
int[]/long[] fancy index -- is "too many indices" in NumPy: a scalar has no axes to
consume. NumSharp's SINGLE-index dispatch bypasses the PrepareIndex gate (which is
keyed on indicesLen != 1), so these fell straight into the fancy gather and returned
a bogus shape instead of raising.
NumPy (mapping.c prepare_index):
np.array(5)[np.array([],int)] -> IndexError "too many indices for array:
array is 0-dimensional, but 1 were indexed"
np.array(5)[np.array([-1,-1])] -> same
np.array(5)[np.array([F])] -> same (a bool array consumes its ndim axes)
np.array(5)[np.array(True)] -> OK (1,) (a 0-d bool consumes no axis)
np.array(5)[np.newaxis] / [...] -> OK (no axis consumed)
Fix: in the getter and setter single-index (indicesLen == 1) NDArray / int[] / long[]
branches, when this.ndim == 0 raise the NumPy IndexError unless the index is a 0-d
boolean (which adds a length-1/0 axis and is handled by the mask path below). The
"N were indexed" count is the bool array's ndim (axes it expands to) else 1.
Random differential sweep (seed 20240626): the get-side "ns-accepted-invalid"
bucket (all base-S, the 0-d scalar) drops 12 -> 0. Curated/dtype gate stays 0/0;
full CI 10980 pass / 0 fail on net8.0 AND net10.0.
An advanced-index combo whose broadcast result is EMPTY (some advanced index has size 0) must yield an empty-shaped result; NumPy gathers nothing and never validates the un-accessed index values. NumSharp instead bounds-checked the sibling fancy values and rejected zero-length boolean masks, throwing on valid empty selections. Random differential sweep (seed 20240626): the get-side "ns-threw-on-valid" bucket drops ~13 -> 0 and 5 setter cases that share these paths also clear (set region [7000,10000) 8 -> 3). Curated/dtype gate stays 0/0; full CI 10980 pass / 0 fail on net8.0 AND net10.0. Two NumPy-verified (2.4.2) rules: 1. Skip fancy-ARRAY value bounds when the advanced block is empty (PrepareIndex FinishPrepare). NumPy bounds-checks integer array values at GATHER time, so when the block broadcasts to a 0 nothing is gathered and out-of-range values are never seen: A[arr([-3,1,2,3],(4,1)), np.array([],int)] -> (4,0) and A[arr([99]), np.array(False)] -> (0,4) are valid (NOT IndexError). A SCALAR int index is still validated eagerly (A[np.array([],int), 9] still raises on the 9), as is a bool array's axis size; broadcastability is still checked, so an un-broadcastable empty combo ((2,2) with (0,)) still raises shape-mismatch. The block is empty iff a Fancy / 0-d-bool op is itself size 0. (Used `is not null` for op.Array: the NDArray == / != operators are element-wise, not reference.) 2. A zero-length boolean mask axis matches an array axis of ANY size (IsPartialShapeMatch). NumPy: A[np.zeros(0,bool)] -> (0,4) on a size-3 axis, A[np.zeros((3,0),bool)] -> (0,); only a NON-zero mask axis must equal the array axis (A[np.zeros((0,2),bool)] still raises on the size-2 axis). This fixes the leading-empty-mask-plus-basic combos (ACS[barr([],(0,)), 0] -> (0,), B[barr([],(0,)), -1] -> (0,4)) that routed through this[mask] and were rejected. BooleanMask already returns (0,)+trailing for a zero-true mask. Remaining (tracked): 3 non-consecutive 0-d-bool placement shapeDiffs (R2) and 3 setter throw-on-valid empty cases; plus the flaky teardown OOB (R3).
…-op, not a crash A pure-basic indexing assignment (slices / newaxis / ellipsis / scalar int) whose selection is EMPTY assigns nothing in NumPy — but the value must still broadcast to the empty target shape (a scalar/size-1 always does; an incompatible value raises ValueError). NumSharp routed the empty sliced view through NDIter.Copy, whose CreateCopyState indexes the first element of each operand and threw IndexOutOfRangeException on the 0-size view. NumPy (2.4.2): a[None, :0:3, 2] = np.array([15]) -> no-op, selection (1,0) a[-1:-4, -2:2:2, ...] = np.array([]) -> no-op, selection (0,0) a[:, ::2, ::-1, None] = np.array([42]) -> no-op, selection (1,0,2,1) a[:, 1:1] = np.array([1,2,3]) -> ValueError (value can't broadcast to (3,0)) Fix: in UnmanagedStorage.SetData(NDArray, int[]), the broadcasted/sliced branch now fetches the target view once and, when it is size 0, validates the value broadcasts to the target shape (NumPy "could not broadcast input array from shape X into shape Y" on mismatch — the basic-indexing message form) then returns without invoking the copy iterator. A scalar/size-1 value skips the check (always broadcasts). Random differential sweep (seed 20240626): set region [7000,10000) reaches 0 divergences (3 -> 0); the whole sweep is now down to 3 (the R2 0-d-bool placement shapeDiffs). Curated/dtype gate 0/0; full CI 10980 pass / 0 fail on net8.0 AND net10.0.
… hash collision (R2)
Two fixes; together they take the random differential sweep (seed 20240626) to 0
divergences across every measurable window (8700/10000; the [6700,7000) gap is the
R3 teardown-OOB crash zone, tracked separately). Curated/dtype gate 0/0; full CI
10980 pass / 0 fail on net8.0 AND net10.0.
1. Non-consecutive 0-d-bool axis placement (NDArray.Indexing.Selection.Getter.cs).
When 0-d bools / ints in an advanced block are SEPARATED by a slice or newaxis,
NumPy moves the merged advanced axis to the FRONT (_get_transpose); the in-place
handler TryBuild0dBoolWithBasic can't express that. It now BAILS for a
non-consecutive advanced block (the items span a slice/newaxis), routing to the
grid which has the consec/front rule. e.g.
a[int, slice, True] -> NumPy (1,0) was (0,1)
a[slice, True, slice, False]-> NumPy (0,2,1) was (2,0,1)
a[int, None, slice, False] -> NumPy (0,1,0) was (1,0,0)
The grid also had to carry the block dims when the block is PURELY 0-d bools (no
fancy array): nothing fed bshape into broadcast_arrays, so an any-False block kept
the all-ones default length 1 instead of 0. The grid now stretches its index
arrays to the already-correct outShape.
2. Shape.Broadcast short-circuited on a hash COLLISION (View/Shape.Broadcasting.cs).
The shape hash collides a 0-length axis with a size-1 axis — (1,1) and (0,1) both
hash to 26599 — and Broadcast returned the left shape unchanged whenever the
hashes matched, so broadcast_to((1,1),(0,1)) wrongly yielded (1,1) instead of
stretching the size-1 axis to the length-0 axis (NumPy: (0,1)). This surfaced in
the pure-0-d-bool grid above and is a general broadcasting bug. The identical-shape
fast path now confirms the dimensions are actually equal before short-circuiting;
on a collision it falls through to the real broadcast (a tiny O(ndim) loop only
when the hashes already match).
…ty bool mask Adds Indexing.CombinatorialParity.MatrixTests.cs — a [FuzzMatrix] CI gate that pins the five mapping.c-parity buckets fixed this session (R1 value-broadcast on empty/ scalar selections, B2 0-d-base over-index, B3 empty advanced gather, B4 empty-slice assignment no-op, R2 non-consecutive 0-d-bool placement). Every shape/value/raise was probed against NumPy 2.4.2. These cases come from the seeded random sweep (index_random_20240626.jsonl), which stays [OpenBugs] only behind the flaky teardown OOB (handover R3) — so the now-passing forms are gated here independently. Writing the tests surfaced one more divergence (NOT in the random corpus, which only had 1-D empty masks): a MULTI-DIMENSIONAL empty boolean mask was routed as a single empty integer fancy index, giving the wrong rank. NumPy treats an empty bool array as a MASK that consumes mask.ndim axes via its nonzero: A[np.zeros((3,0), bool)] -> NumPy (0,) NumSharp was (3,0,4) A[np.zeros(0, bool)] -> (0,4) (the 1-D case matched only by coincidence) Fix: in the getter and setter single-index dispatch, exclude boolean arrays from the size-0 empty-fancy routing so every empty bool mask falls to the mask path (which, since the IsPartialShapeMatch zero-axis fix, returns (0,)+trailing correctly). Full suite 11003 pass / 0 fail on net8.0 AND net10.0 (10980 + 23 new); random sweep stays 0 divergences across all measurable windows.
…e open item Updates the combinatorial-indexing handover to the executed reality: all five divergence buckets (R1 value-broadcast on empty/scalar selections, B2 0-d-base over-index, B3 empty advanced gather, B4 empty-slice assignment, R2 non-consecutive 0-d-bool placement + the Shape.Broadcast hash collision it uncovered) are fixed, committed (aea9fc78..7e968f5e), and pinned by the new Indexing.CombinatorialParity [FuzzMatrix] gate. The random sweep is 0 divergences across every measurable window. The only open item is R3, a pre-existing flaky teardown heap-corruption. Records the diagnostics gathered this session that supersede the prior "pinned managed" guess: it is a specific corpus shape (not allocation volume), it is NOT a pooled native-buffer overrun (end-aligned guard pages run clean), and it resists deterministic repro (crash point varies 6285-9879; ~1/3 even in the tightest window). Next-session plan: a per-case red-zone on FromArray / the direct-VirtualAlloc zeroed path, or a dotnet-dump capture. Index_Random stays [OpenBugs] purely on the crash, not parity.
Mechanical, behavior-preserving rename of NumSharp's iterator/expression
stack from the `Npy*` prefix to `ND*`. ND* matches NumSharp's house style
(`NDArray` ↔ numpy.ndarray, the retired NDIterator) and NumPy's Python
`nditer` name. The old `Npy*` prefix mirrored NumPy's C struct name
(`NpyIter`); `ND*` is the user-facing convention used everywhere else.
Scope (NumSharp-owned only):
- Types/delegates/enums: NpyIter→NDIter, NpyIterRef→NDIterRef,
NpyExpr→NDExpr, NpyIterState/Flags/PerOpFlags/GlobalFlags/OpFlags,
NpyFlatIterator→NDFlatIterator, NpyAxisIter→NDAxisIter,
NpyMemOverlap→NDMemOverlap, reduction-kernel structs + interfaces
(INpy…→IND…), NpyArrayMethodFlags→NDArrayMethodFlags, and the
NpyIter_* C-API-style method names → NDIter_*.
- Utilities: NpyComplexMath→NDComplexMath, NpyDivision→NDDivision,
NpyIntegerPower→NDIntegerPower.
- Benchmark subsystem: benchmark/npyiter → benchmark/nditer
(npyiter_{bench,sheet,cards,results,headline} → nditer_*,
--skip-npyiter → --skip-nditer).
- 65 files renamed via git mv; ~190 files content-swept; website docs,
docs/numpy notes, and frozen benchmark/history snapshots included.
Preserved (genuine NumPy references, NOT the stack):
- src/numpy/** (the upstream clone — NpyIter is NumPy's real C type).
- The .npy/.npz file format: `#region NpyFormat` (np.save/np.load) and the
SaveAndLoadWithNpyFileExt test.
- NumPy's C function names quoted in docs (npyiter_allocate_arrays,
npyiter_coalesce_axes, … kept verbatim).
Build: solution green (0 errors). Tests: 10980 passed, 0 failed, 11 skipped
(net10.0, CI filter TestCategory!=OpenBugs&!=HighMemory).
Branch commit messages were rewritten Npy→ND separately (message-only
history rewrite; file blobs in historical commits untouched). This commit
is registered in .git-blame-ignore-revs as a mechanical rename.
…e nditer branch Replaces the stale PR description (written ~64 commits in, +50k lines) with a complete changelog of everything between the #612 merge-base (5eedb81) and HEAD: 272 commits, 519 files, +198,407/-16,069 per the GitHub compare. Compiled via a two-pass audit: - Pass 1: every commit subject+body mined for features, perf numbers, and breaking changes; APIs/CI/benchmark/corpus facts verified against the live tree (test counts, fuzz corpus wc, Direct partial count, NDIter LOC). - Pass 2: all 279 local commits re-walked against the draft. Caught and fixed: np.median/percentile/quantile/average/ptp/tile did NOT exist on master (verified via git grep origin/master) — reclassified from 'rebuilt' to new, raising the new-API count 22 -> 30; removed an unverifiable test count; added the 15-dtype hot-path parity item (786d705) and the DefaultEngine->NDIter Tier-3B production routing. Scope note: SByte/Half/Complex + DateTime64 + casting rounds are PR #612 (already on master) and are intentionally excluded; the local master ref is stale, which is why master..HEAD misleadingly shows 339 commits. The same content (minus the H1) is now the live PR #611 description, pushed via REST PATCH (gh pr edit requires read:org scope the token lacks).
Complete changelog of the
nditerbranch — everything in this PR since #612 merged.455 commits · 806 files · +234,348 / −19,179 (vs
master, after #612)TL;DR
NpyIter— full port of NumPy 2.4.2'snditer(~12.5K lines): all iteration orders (C/F/A/K), all indexing modes, buffered casting, buffered-reduce double-loop, masking, memory-overlap protection (COPY_IF_OVERLAP), windowed buffering (DELAY_BUFALLOC), unlimited operands and dimensions. 566+ byte-for-byte NumPy parity scenarios.NpyExprDSL + three-tier custom-op API — write your own ufuncs: raw IL (Tier 3A), element-wise scalar/SIMD (Tier 3B), or composable expression trees with operator overloads (Tier 3C). Exposed as the publicnp.evaluate, which runs fused expressions 3.2–6.1× faster than NumPy (which can't fuse), with per-node NumPyresult_typetyping and fused reductions.out=/where=/dtype=ufunc kwargs across the elementwise API — the kwargs on every NumPy ufunc, spanning the binary, unary-math, comparison, predicate, and bitwise families with exact NumPy broadcast/cast/error-text semantics. Plusnp.bitwise_and/or/xorandnp.positiveat thenp.*surface.np.*APIs —sort,pad(11 modes),tile,median/percentile/quantile(all 13 interpolation methods) + theirnan*variants,average,ptp,take/put/place,extract/compress,diagonal/trace,argwhere/flatnonzero,unravel_index/ravel_multi_index/indices,delete/insert/append,diff/ediff1d,asfortranarray/ascontiguousarray,np.multithreading.Shapeunderstands F-contiguity,OrderResolverresolves NumPy order modes, ~68 layout bugs fixed across 9 fix groups.np.sort/np.argsorton a radix line-kernel (closes a Missing Function); a SIMD strided-cast campaign that killed the cast cliffs (15×8×15astypematrix: 716 → ~391 lagging cells, 852 → 1,177 winning cells vs NumPy);np.zerosviacalloc/demand-zero (O(1), was ~1000× slower); the six Complex transcendentals (sinh…arctan); and bit-exact pairwise summation forsum/mean.IDisposableonNDArray, plus a tcache-style buffer pool (1 B – 64 MiB window).MultiIterator, the Regen-generated cast templates, andNDIteratoritself (interface + class +AsIteratorextensions) are all gone; every code path now iterates throughNpyIter/NpyFlatIterator/GetAtIndex.1. NpyIter — full NumPy
nditerportFrom-scratch C# port of NumPy 2.4.2's iterator machinery under
src/NumSharp.Core/Backends/Iterators/(~12,557 lines), promoted to public API with NDArray overloads.MULTI_INDEX,C_INDEX,F_INDEX,RANGE(parallel chunking),GotoIndex/GotoMultiIndex/GotoIterIndexDELAY_BUFALLOC, buffered-reduce double-loop (incl.bufferSize < coreSize)op_axeswith-1reduction axes,REDUCE_OK,IsFirstVisit,REUSE_REDUCE_LOOPSslab accumulationCOPY_IF_OVERLAPvia a port of NumPy'smem_overlapsolver (NpyMemOverlap.cs) — overlapping in/out operands no longer silently corruptWRITEMASKED+ARRAYMASKexecuted — the buffered window flush writes back only mask-nonzero elements;VIRTUALoperands (null op slots) construct with NumPy 2.x semanticsNPY_MAXARGS=64) and unlimited dimensions (NumPy caps atNPY_MAXDIMS=64) via dynamic allocationCopy,GetIterView,RemoveAxis,RemoveMultiIndex,ResetBasePointers,IterRange,DebugPrint, fixed/axis stride queries,GetValue<T>/SetValue<T>, …NpyIterCasting.CanCastmatches NumPy'ssafe/same_kindlattice exactlyValidated by a dedicated battletest harness: 566 scenarios replayed against NumPy 2.4.2 byte-for-byte, a permanent variation-probe harness, and
tools/iterator_parity. Dozens of parity bugs found and fixed against NumPy ground truth: negative-stride flipping,NO_BROADCASTenforcement,F_INDEXcoalescing, buffered-reduction stride inversion, K-order on broadcast inputs, EXLOOPiternext, buffered-castAdvance, rangedReset()desync, buffer free-list corruption, the size-1 stride-0 invariant (a(1,4)view with nonzero stride corruptedRemoveMultiIndex),op_axesout-of-bounds reads on stretched size-1 axes, write-broadcast validation,PARALLEL_SAFEwiring, and unit-axis absorption — each reproduced against NumPy first, then fixed by adopting NumPy's constructor structure.Execution at NumPy speed
NpyIterisn't just correct — it is now the production execution engine:DefaultEngine's binary, unary, and comparison ops (same- and mixed-dtype) route through the NpyIter Tier-3B shell, and it measures at-or-faster than NumPy on every probed aspect (Release, i9-13900K, NumPy 2.4.2):a*b+c10M(a-b)/(a+b)10MKey mechanisms: an O(1) trivial-loop bypass that skips iterator construction for contiguous operands, identity-broadcast fast paths, AVX2 hardware-gather (
vgatherdps) strided SIMD in the Tier-3B shell (NumPy uses scalar loops for strided binary/reduce — its floors are beatable), and strided-reduction kernels (2-D strided sqrt 1.36× faster than NumPy, strided sum 2.2× faster).2. NpyExpr DSL + three-tier custom-op API
User-extensible kernel layer on top of
NpyIter— the public answer to "how do I write my own ufunc":ExecuteRawIL: emit raw IL against the NumPy ufunc signaturevoid(void** dataptrs, long* strides, long count, void* aux).ExecuteElementWise: provide scalar + vector IL; the shell supplies a 4×-unrolled SIMD loop, remainder vector, scalar tail, and strided fallback.ExecuteExpression: composeNpyExprtrees with C# operators ((a - b) / (a + b)), 50+ node types (arithmetic, trig, exp/log, rounding, predicates, comparisons,Min/Max/Clamp/Where), plusCall()to splice any delegate/MethodInfointo a fused kernel. Compiled once, cached by structural key, ~5 ns dispatch.This is what powers the fusion wins — one pass, no temporaries — and it is exposed publicly as
np.evaluate(expr[, operands][, out]):result_typetyping — every node resolves to its NumPy 2.4.2 dtype, so mixed trees wrap correctly:(i4*i4)+f8wraps the multiply in int32 (→1410065408) before promoting. Strong-strong NEP50 (incl. int/float tier crossing), weak python-scalar literals (i4+2 → i4,i4/2 → f8) with NumPy's exactOverflowError, and special resolvers (true_divide,arctan2, negative-integer-literalpower→ValueError, booladd=OR/multiply=AND).NpyExpr.Sum/Prod/Min/Max/Meancompile a one-pass inner loop;sum(a*b)readsaandbonce and never materializes the product. NumPy reduction dtypes (int→i64, uint→u64, mean→f64).out=joins via the ufunc rules (same_kind validation, reference identity, overlap-safe aliasing throughCOPY_IF_OVERLAP); anEXTERNAL_LOOPguard prevents the silentcount==1slow path.a*b+c3.2×,(a-b)/(a+b)6.1×,sum(a*b)3.6×,sum f322.9×,i4*2+f83.5× faster. Permanent gate inbenchmark/fusion/evaluate_bench.{cs,py}.3. Legacy iterator stack retired
MultiIteratordeleted; all callers migrated toNpyIter.Copy/ multi-operand execution.NDIterator.template.cs+ 16 generatedNDIterator.Cast.*partials deleted (−3,870 LOC in one commit).NDIterator(interface +NDIterator<T>+AsIteratorextensions) deleted entirely —[Obsolete]tombstones that threw at runtime after the migration and were referenced by nothing live. Production iteration runs throughNpyIter/NpyIterRef(kernels),GetAtIndex(flat reads), andNpyFlatIterator(np.broadcast(...).iters).~400per-dtypeNPTypeCodeswitch sites replaced by a genericNpFuncdispatch utility.4. C/F/A/K memory-layout support
Shapenow tracks F-contiguity with NumPy-convention contiguity computation; newOrderResolverresolvesC/F/A/Kfor every API with anorderparameter.copy,array,asarray,asanyarray,*_like,astype,flatten,ravel,reshape,eye,concatenate,cumsum,argsort,tile, plus post-hoc F-contig preservation across the IL-kernel dispatchers.np.asfortranarray,np.ascontiguousarray.np.whereselects C/F output layout the way NumPy does;ravel('F')of an F-contig source returns a view (was a 3,000× copy).fortran_order, Decimal scalar path, fancy-write isolation, …).5. New & completed
np.*APIsNew functions (36):
np.evaluate(fused expressions — see §2),np.bitwise_and,np.bitwise_or,np.bitwise_xor,np.positivenp.sort(+ndarray.sort;np.argsortreimplemented) — radix line-kernel on NpyIter, stable, NaN-last, all axes / orders (IterAllButAxisdrive mirroring NumPy's_new_sortlike)np.pad(all 11 NumPy modes + callable),np.tile,np.delete,np.insert,np.appendnp.take,np.put,np.place,np.extract,np.compress,np.argwhere,np.flatnonzero,np.diagonal,np.trace,np.unravel_index,np.ravel_multi_index,np.indicesnp.median,np.percentile,np.quantile(all 13 interpolation methods, tuple axis,out=,keepdims, QuickSelect engine),np.average(weights,returned, tuple-axis; fused kernel 1.3–1.6× faster than NumPy at 1M),np.ptp,np.nanmedian,np.nanpercentile,np.nanquantilenp.diff,np.ediff1dnp.asfortranarray,np.ascontiguousarraynp.multithreading(enabled, max_threads)— opt-in threaded kernelsRebuilt to full NumPy 2.x parity:
np.clip—min=/max=keyword aliases, default-None bounds, NumPy 2.x dtype promotion,out=validation.np.unique— 5 missing kwargs, sort+mask algorithm (up to 43× faster), NaN partitioning,n > Array.MaxLengthfallback.np.searchsorted—side=,sorter=, multidim validation; IL binary-search kernels 5–25× faster (beats NumPy on 20/22 benchmarks).np.copyto—casting=,where=masked copies at NumPy speed (was 7–72× slower).np.asarray—copy=,like=,device=, dtype-as-string.np.concatenate— full parity + C/F fast paths.np.all/np.any— tuple-axis,out=,where=.np.expand_dims— tuple axis.np.repeat—axis=parameter.np.power— integer-power semantics, negative-exponentValueError, crash fix.np.broadcast— N-operand form (0..64, then unlimited — NumPy parity, was 2-operand only), live index cursor, lazy.iters,.numiter.max/min, Complex quantile,IsInfimplemented (was a stub); the six Complex transcendentalssinh/cosh/tanh/arcsin/arccos/arctanimplemented (hybrid BCL + C99 edge fix-ups, NumPy 2.4.2 parity — wereNotSupportedException).out=/where=/dtype=ufunc kwargs (NumPy parity):The kwargs present on every NumPy ufunc now span the elementwise core — binary (
add,subtract,multiply,divide,true_divide,mod,power,floor_divide), unary-math (sqrt,exp,log,sin,cos,tan,abs/absolute,negative,square), the six comparisons, predicates (isnan/isfinite/isinf), bitwise,invert,arctan2— each as one NumPy-shaped overload, every rule pinned against NumPy 2.4.2:outjoins the broadcast but never stretches (mismatched/stretchableoutraise NumPy's exact texts, trailing space included); loop dtype resolved from inputs (NEP50),outonly needs a same_kind cast; the provided instance is returned (reference identity).wheremust be exactlybool(mask cast under 'safe'); it broadcasts over operands and participates in output shape; mask-false slots keep prioroutcontents.outaliasing an input is well-defined viaCOPY_IF_OVERLAP—add(x[:-1], x[:-1], out=x[1:])matches NumPy exactly.dtype=computes in the loop dtype (subtract(300, 5, dtype=i16) = 295), with the booladd→OR /multiply→AND remap keyed off the final loop dtype soadd(True, True, dtype=i32) = 2.6. Linear algebra
Vector256FMA micro-kernel reads packed panels, so transposed/sliced inputs cost nothing extra. Eliminates the ~100× fallback penalty (np.dot(x.T, grad): 240 ms → ~1 ms) and the boxingGetValuefallback chain.matmulgufunc semantics — batched stacking, 1-D promotion/squeeze rules, validated by a dedicated differential matrix (816 cases).np.multithreading— opt-in parallel 1-D dot: 1M float dot 172 → 60 µs, ~2× faster than NumPy's default build. Off by default; bitwise-identical summation order when off.7. Performance (beyond NpyIter and linalg)
sum(int16, axis=1)1058 ms → 2.7 ms (389×, now faster than NumPy); int32/uint32 2.3–4.6×; also fixes a uint32 axis-sum corruption bugmean(axis)var/std21×;count_nonzero20×np.nonzeronp.wheresqrtreached parity via gather→tile→SIMD bufferingVirtualAlloc+ demand-zero faults); ≥1 MiB buckets capped at 2 buffers; pool-side GC memory pressure tracking live state;GC.SuppressFinalizeon free;using/ARC adopted acrossconcatenate,allclose,convolve,tile,eye, masking, shuffle, …astypematrix —cvttfloat→int, Giesen f16↔ widen/narrow, complex deinterleave, sub-word VPSHUFB shuffles, fused VPGATHER whole-array kernels, single-pass KEEPORDER same-type copy. Cliffs eliminated: 716 → ~391 lagging cells, 852 → 1,177 winning cells vs NumPynp.zeroscalloc/ WindowsVirtualAllocdemand-zero — O(1) regardless of size (10M f64: 14.3 ms → ~0.01 ms, was ~1000× slower)sum(broadcast_to(...))now ~534–700× faster, beats NumPy, bit-exactsum/mean(float)np.add.reducebit-for-bit (unblocks float32)np.any/np.all(bool/char)ForEachaxis reductions — Decimal 5–13×, Half mean 1.6–3.7×, Complex mean 15–45×→parity; float16 negate ~10× via sign-bit flipfloat→int32)cvtt, strided/reversed/gathered variantsnp.splitfamily8. Official benchmark suite + honest methodology
run_benchmark.pyentry point: BenchmarkDotNet Full rigor (50 iters, InProcessEmit) × all suites × {1K, 100K, 10M} vs NumPy 2.x — 1,813 C# measurements, 1,111 matched op×dtype×size comparisons, structural op-name join, tracked markdown report + per-suite artifacts + history snapshots. Coverage spans all 15 dtypes (SByte/Half/Complex suites added).dotnet runfile-based apps compile the project reference in Debug (optimizations off) even withConfiguration=Releaseproperties — hand loops measured ~2× slow while DynamicMethod IL was immune. Benchmarks now assertIsJITOptimizerDisabled == falseand refuse to mislead; the rule is documented.run_benchmark.py, plus a post-release CI workflow (.github/workflows/benchmark.yml) that auto-commits report cards to master.np.sumover abroadcast_toview (was 54× slower) folds stride-0 axes algebraically and runs ~534–700× faster than NumPy, bit-exact; scalarnp.any/np.allon bool/char (was 5–12× slower) reinterpret onto the integer SIMD path;np.zeros(was ~1000× slower) is calloc-backed. Remaining tracked items: small-N (~1K) per-call dispatch overhead and a few iterator edge cases pinned as[OpenBugs]/skipped repros. A win surfaced too: hand-rolled 8-band parallel iteration 4.7×.9. Differential fuzzing vs NumPy (new infrastructure)
.github/workflows/fuzz-soak.yml).docs/FUZZ_FINDINGS.md; every fixed class re-armed as a permanent regression gate. The error-parity tier alone surfaced 1 critical crash; the op tiers surfaced 17+ distinct bug classes that are now fixed (see §10).10. Correctness — NumPy-parity bug fixes
Semantics (behavioral changes, may affect callers):
floor_divide/mod: NumPy-exact floored semantics and divide-by-zero results.<=/>=now returnFalsefor NaN (IEEE/NumPy).min/maxpropagate NaN.np.negative(uint)wraps modulo 2ⁿ instead of throwing;bool - booland-bool/np.negative(bool)now throw (NumPy behavior).np.power: negative integer exponent raisesValueError; exact integer-power semantics.ConvertValue);complex→boolno longer drops the imaginary part;float→intSIMD uses truncation (cvtt) like NumPy.[1]meets a lower-rank operand; quantile-family dtype & bool handling; Complexnp.where.reciprocal(0)is per-width exact:int32/int64→MinValue,uint64→ 2⁶³, but0for int8/int16/uint8/uint16/uint32 (was MinValue/0 across the board);bool→ int8.clip/maximum/minimum: float16 signed-zero scalar tail, NaN propagation through the SIMD kernel, and correct F-contiguous/strided element pairing.float16axis sum accumulates infloat32(NumPy parity); Complex flatmin/maxreturn the NaN-bearing element verbatim; Complex unary math ported from NumPy's own C99 algorithms.Crashes & corruption:
COPY_IF_OVERLAP, §1).WRITEMASKEDwrite landed garbage in exactly the slots NumPy preserves (silent corruption of the elements the caller asked to protect) — now writes back only mask-nonzero elements.np.pad: 5 correctness/crash bugs (battle-tested against NumPy 2.4.2); linear_ramp preserved Complex dtype.UnmanagedStorage/ArraySlice:CopyTodirection + bounds bugs;CloneDatapartial-buffer bug; scalar offset lost onClone; bufferedNpyIter.Cloneshared buffers;DTypeSizereportedMarshal.SizeOfinstead of in-memory stride;NPTypeCode.Char.SizeOfreturned 1 (real: 2); stale Decimal priority.TensorEnginenow propagates throughCast/Transpose/copy/reshape/ravel(custom engines were silently dropped).takewithout=enforces NumPy's safe-cast direction;put/placenon-contiguous writeback fixes;argsorton non-C-contiguous input.ForEach/ExecuteGeneric/ExecuteReducingread past the end withoutEXTERNAL_LOOP.np.exp2float32-output IL kernel was malformed (InvalidProgramException);np.powerwith a Half exponent threwInvalidCastException; a narrowingdtype=on a complex float-ufunc segfaulted — all fixed.nansumaxis reduction read uninitialized memory forndim ≥ 3; the AVX2 32-laneany()mask overflow (byte/sbyte) returned wrong results; net8.0 complexabsand axismin/maxNaN propagation corrected.11. Memory management — ARC +
IDisposableNDArraynow implementsIDisposablebacked by atomic reference counting on the unmanaged block: CAS-drivenTryAddRef/Release, idempotentDispose, finalizer safety net, immortal non-owning wraps. Views keep parents alive; parent disposal never invalidates live views.dotat 100K: 446 collections → 0).12.
Char8primitiveNew 1-byte character type (
NumSharp.Char8) — the NumPyS1/Pythonbytes(1)equivalent — with conversions, operators, span helpers, and 100% PythonbytesAPI parity validated against a Python oracle. Vendored .NET ASCII/Latin-1 reference sources undersrc/dotnet/document the upstream implementations it mirrors.13. Examples — trainable MNIST MLP
New
examples/NeuralNetwork.NumSharp: a 2-layer MLP with a naive implementation and a fused one (single-NpyIterbias+ReLU fusion, fused softmax-cross-entropy backward, Adam optimizer). Originally needed a "copy transposed views beforenp.dot" workaround (31× training speedup at the time); the stride-native GEMM (§6) made the workaround unnecessary. Converges to >99% test accuracy in the bundled demo.14. Kernel architecture & hygiene
ILKernelGeneratorsplit intoDirectILKernelGenerator(legacy whole-array kernels, 51 partials underKernels/Direct/) andILKernelGenerator(NpyIter-driven per-chunk kernels — the target model matching NumPy'sPyUFuncGenericFunction); migration path documented per kernel family.Vector128/256/512andMath/MathFreflection centralized inVectorMethodCache/ScalarMethodCache; IL-emitted typed-field copier replaces theUnmanagedStorage.Aliasswitch.[Obsolete(error: true)]tombstones, referenced by nothing); dead axis-reduction SIMD paths removed.15. Documentation
docs/website-src/docs/NDIter.md(7-technique quick reference, decision tree, memory model, gotchas) +ndarray.md.benchmarks.md(head-to-head evidence companion to the IL-generation page),benchmark-iterator.md,benchmark-matrix.md, driven by the auto-committed report artifacts.PERF_LEDGER.md(every optimization with before/after),NPYITER_GAPS_AND_ROADMAP.md(gap analysis vs NumPy 2.4.2 + prioritized roadmap),MIGRATE_NPYITER.md, IL-kernel playbook, fuzz findings/coverage.test/NumSharp.UnitTest/AuditV2/AuditV2_*.cs— every Tier-1 finding fixed or reproduced as an[OpenBugs]test.16. Tests & CI
np.evaluate(per-node wraparound, dtype matrices, weak scalars + overflow, fused-vs-unfused,out=identity/cast/aliasing, fused reductions),out=/where=/dtype=parity suites (broadcast/cast/error-text pins), WRITEMASKED/VIRTUAL parity; NpyIter battletests (566 scenarios), order-support sections 41–51, ARC lifecycle, clone regression, np.pad/average/median/percentile/ptp/diff battle tests, IL-kernel battle tests, behavioral audit harness.build-and-release.yml, nightlyfuzz-soak.yml, new post-releasebenchmark.yml(auto-commits NumPy-comparison report cards to master).flip/fliplr/flipud/rot90,diag,gradient, andround(np.sortis now done); small-N (~1K) per-call dispatch overhead is the headline performance focus (docs/NPYITER_GAPS_AND_ROADMAP.md); a few iterator edge cases remain pinned as[OpenBugs]/skipped repros. Every open issue found by the audits/fuzzers/benches is checked in as a failing-by-design test rather than ignored.Breaking changes
bool - bool,-bool,np.negative(bool)now throw^/ cast to int first<=/>=returnsFalsenp.isnanexplicitlyfloor_divide/moddivide-by-zero & floored resultsnp.negative(uint)wraps instead of throwingnp.power(int, negative int)raisesValueErrornp.clip/quantile-family dtype promotion[1].copy()to writeMultiIteratorandNDIterator(+NDIterator<T>,AsIterator) removedNpyIter/NpyIter.Copy/NpyFlatIteratorMaxOperands=8and 64-dim limits removednp.copytounwriteable-destination error type correctedEverything above was validated against NumPy 2.4.2 ground truth — by 37k differential corpus cases, 566 iterator parity scenarios, and per-feature battle tests run on actual NumPy output.